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is always well to retain a clear geometric view of 
the facts when we are dealing with statistical prob- 
lems, which abound with dangerous pitfalls, easily 
overlooked by the unwary, while they are cantering 
gaily along upon their arithmetic.' ’ 


Francis Galton. 




pref/(6e 

Much more than a trifle of truth is embodied in a simile drawn at the 
expense of many research workers who prop their argument with sta- 
tistical proof. Likened to inebriated gentlemen affectionately clinging to 
lamp-posts, they are said to regard Watistics as a source of support rather 
than as a center of illumination. 0^ need not emphasize here the all too 
frequent appropriateness of the criticj^^^tather,..Qrm is concer^d with 
the frank admission m Vhe criti^hat Hatistics do form^a^ource of 
illumination for thosA whose wits are not diined“~b5?^t^nbbr^ Those 
whose intellectual actmti^ find stimulus in the attainment of rigorous 
analysis turn toward sta\ist\al reaWning for illumination in the study of 
variation systems. \ \ 

The steadily increasing\iccep^nce of st^jti^ical methodology as a 
desirable tool in scientific an^sis^^reflected in the wid^ing movision 
made in university curricula for\he teaching of th^-stfSject. Mias been 
the privilege of the author to be engagechKsuch instru^on at the 
University of Minnesota Ifor the past ten years. Built on foundations 
laid by Professor J. Arthur Harris, the exposition gkren in this book 
reflects the experience acquired during that penpdr^^he author’s classes 
have been composed principallySf^raduat^^udents with a great diver- 
sity of objectives. The urge to secure an understanding of the principles 
of statistical rea.soning has, in the.se classes, made companions of ento- 
mologist and educator, of clinician and chemical engineer, of sociologist 
and mathematician, as much as of anatomist and geneticist. 

An interest in the cultivation of logical analysis in science is all that 
has been asked as a prerequisite. Uncommon in these times is the student 
of biological phenomena who has been instilled with a keen appreciation 
of the analytical power in quantitative logic which a knowledge of 
mathematics may open to him; rare indeed is the one whose fortune it has 
been to acquire that power from formal courses in pure mathematics. 
Nevertheless, some aptitude for logical analysis in quantitative thinking 
is innate to most students who choose science as a prospective profession. 
A course in statistics may well be arranged to develop that aptitude 
through appeal to realities, whereas a prerequisite requirement of further 
training in theoretical mathematics oft-times conjures fears that effec- 
tively block further progress. Experience has amply shown that such 
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proficiency in mathematics as is finally demanded of the sound practical, 
statistician may usually be developed quite well from but humble begin- , 
flings, and without recourse to current mathematical pedagogy. 

To maintain the interest of the non-mathematically trained student, 
and to assist him in visualizing the nature of the problems under analysis; 
the method of geometric portrayal of situations by means of diagrams is 
freely employed. On that foundation, algebraic derivations of a simple 
nature may be developed without substantial loss of enthusiasm. 
Throughout the process the author has discerned from his classes that an 
appreciation of mathematics is inculcated, and some mastery of its 
reasoning processes is acquired. 

Those who seek a compendium of statistical techniques, an array of 
formulas from which one or another may be culled to act as a mill for 
grinding fine flour from crude grain, merely by turning a handle, should 
not look further through these pages. What is written herein is intended 
for those who wish to reason carefully, not merely imitate. A sequence of 
such general concepts as seem to the author to be foundational is accord- 
ingly developed. From such a beginning it is believed that one may work 
forward to comprehension of the details of specialized procedures. 

An original plan to include a critical analysis of small-sample tech- 
niques was finally laid aside in order that the discussion of elementary 
principles should not be too greatly restricted by the demands of space 
economy. The author believes that unfortunate consequences flow all 
too readily from an approach to statistical reasoning via the “small 
sample. It is not in the contemplation of meager sets of information 
that one most readily g?:asp9 general principles underlying the analysis 
of variation systems. If circumstances warrant it and time permits, the 
author will essay to complete the origti^i pla-n in another voluir^e. 

The inclusion of an array of special exercises in this book has been 
deemed unnecessary. Only in its purely mathematical aspect does the 
statistical technique assume generality. EVery set of factual data in 
science that may be assembled for statistical analysis presents questions 
which are intimately concerned with the validity of the data themselves. 
Misinterpretation all too readily flows from routine application of sta- 
tistical procedures. Full understanding of the limitations of specific sets 
of data may usually be imparted only through lengthy discussion with 
those not familiar with the field. The author trusts that the reader will 
in general be able to provide himself with data in which he is interested 
and with the probable limitations of which he is reasonably familiar. In- 
structors may advisedly accumulate such material for practical assign- 
ments to meet the special interests of their classes. The reader who is 
restricted to the text as a source will, however, find opportunity to test 
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•his grasp of procedure by using the various sets of data found there. It 
•will be noted that, for the most part, these are studied with respect to 
one feature only, leaving ample room for other analyses. 

It is very desirable that verification of all computations made in 
statistical work should be undertaken by the analyst himself. Fre- 
quently the reasonableness of the result of a calculation may be judged 
by careful inspection of the original data, coupled with some rough test 
by mental arithmetic. Such apparent concordance is, of course, not 
completely satisfactory in itself, but it will serve to indicate the presence 
of such gross arithmetical errors as all are subject to making in varying 
degree. Final verification through using a different arithmetic, such as a 
change of code scale or transfer to an alternative formula provides, is 
urged. The ease with which an error may be repeated in doing the same 
calculation over again before it has been completely forgotten is amazing. 
The statistical worker must learn to become self-reliant with respect to 
his arithmetic. Independent verification of results furnishes his most 
satisfactory attack on this problem. 

The omission of a more complete bibliography of those writings which 
have contributed materially to the development of the statistical con- 
cepts dealt with is not accidental. For the most part, reference to them 
would be of very doubtful assistance to the reader to whom this discus- 
sion is addressed. In an elementary textbook of this nature one may in 
large measure and with equal justification present established concepts 
without historical review in much the same spirit as is characteristic of 
the elementary treatise on algebra. 

The author is considerably indebted to others for such value as may 
pertain to this presentation. The memory of a brief association with 
Professoi J. Arthur Harris is a constant impetus to aid the student of 
science in his advance along the path of clear thinking and more precise 
analysis. And the questions and observations of those students, whether 
ingenious or naive, along with stimulating conferences with statistical 
colleagues and inspiring if somewhat difficult hours with the erudite 
writings of the great masters, have played their part to keep the path of 
teacMng forever fresh. That others may have developed essentially the 
same type of exposition in their lectures is to be expected; indeed, the 
author is familiar with one case of striking parallelism. He has not, 
however, consciously borrowed the ideas of another without acknowl- 
edgment. 

If the form of presentation proves as helpful to others as it appears to 
have been to the students with whom the author has worked, then the 
present undertaking of both author .and publisher is justified. It is with a 
deep sense of gratitude that the many hours of most helpful discussions 
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with Dr. Borghild Gunstad are acknowledged by the author. Professor 
Lowell J. Eeed, Dr. W. Edwards Demin g, and Dr. Marian Wilder hav^ 
contributed in like manner. The painstaking work of Miss Marjorie 
Moore in preparation of the diagrams, the curves for two of which were 
kindly supplied by Dr. Deming, is acknowledged with a sense of con- 
siderable indebtedness. 

Alan E. Treloar 

University of Minnesota 
February 9, 1939 
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ELEMENTS OF STATISTICAL REASONING 


CHAPTER 1 

NUMERICAL DESCRIPTION 

There have been many definitions of sciencey each one more or less 
colored by the particular interests of its promulgator and in some way 
failing to satisfy all. In a broad view, however, the essential objective 
of science is embodied in a brief statement which is quite familiar to those 
whose privilege it was to work with the pioneering American naturalist 
and biometrician, Professor J. Arthur Harris: 

“The task of science is the description of the universe.” 

In offering this expression for consideration, one is not concerned here 
with the all-inclusive scope of science which is suggested, but with the 
method of science. It is the task of science to describe. 

One may recognize immediately that the attainments of description 
motivated by scientific objectives vary widely; not all under its title 
necessarily warrants acceptance as science. Only description having the 
quality of undisputed acceptability, of winning universal assent to the 
truth of that set forth, achieves the ideal of scientific exposition. Accept- 
able forlnulations of the “laws of science,” tracing fundamental relation- 
ships within the universe, arise only from description characterized by 
this quality, 

SUBJECTIVE AND OBJECTIVE DESCRIPTION 

The advance from obscure perception toward clearly defined under- 
standing of the universe, physical and organic, has been made through 
increasing objectivity in meaningful description. Workers in some of the 
various divisions of science today have attained great skill in purely 
objective exposition, thereby approaching scientific ideals very closely. 
Others, though striving for the same ends, must still formulate their 
reasoning for the most part in a comparatively ill-defined language of 
purely verbal definitions, that is, m a highly subjective manner. 

The varied degrees of accomplishment in objectivity of description 
characterizing the many fields of scientific inquiry may be considered a 
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consequence of the operation of two primary and interlocking factors. 
First, achievement in some sciences is fundamental to success in others; 
and, second, the complexity of phenomena in some fields must of neces- 
sity make satisfactory description very much more difficult than in 
others. In general, understanding of the elemental materials and forces 
of the universe and their relatively simpler interactions must pave the 
way to analysis of more complicated phenomena. In no small measure 
because of this the physical sciences have forged ahead over relatively 
smooth ground, whereas the basic biological sciences, and still further 
away the social sciences, fighting most complex entanglements of struc- 
tural materials and multitudinous interacting forces, have found that at 
best they may advance but slowly in objective description. 

Attention may be directed to the practical tasks involved in describ- 
ing some well-known biological characters, such as height and health of 
human individuals. One essays to describe such characters because 
people differ with regard to them, and one wishes to make comparisons 
among people with respect to these variables. In a discussion with pre- 
cision objectives it seems foolish at the present time to suggest describing 
human height merely by dividing all individuals into such ill-defined 
classes as are connoted by the words “tali’’ and “short.” Nevertheless, 
some such move forms the first step in description, Subgroups are 
recognized within the whole, the .simplest division being into only tw^o 
classes. One might feel pleased to be able to describe, acceptably to 
others, the health of a large array of individuals by segregation merely 
into the two classes, “satisfactory” and “unsatisfactory.” But of course 
one is in fact able to go much further, be more objective or “.scientific,” 
in describing height than in describing health. We may yneasure height 
in absolute units. The difference between the two situations funda- 
mentally one of degree of simplicity in the concepts of height and health, 
with attendant difference in degree of mastery of the problem of attain- 
ing precision in the description. 

The evolutionary path of description for single characters is a passage 
from recognition of only two subdivisions in the range of variation, to 
recognition of more and more subdivisions giving 3, 4, 5, ei cetera^ nlore 
homogeneous groups qualitatively designated by suitable titles. Con- 
sidering height so described, for instance, one may pause at 5 subdivisions 
for the moment and call them “very tall,” “tall,” “medium,” “short,” 
and “very short.” Essaying to describe height thus in 100 individuals, 
one might study each one singly and finally manage to assign all to what 
seem to the appropriate categories. In the absence of any measure- 
ment or of matching one against another, it would be a truly remarkable 
feat if there proved to be no overlapping of the groups with respect to the 
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statures embraced by them. Optical illusions might well cause some 
gross misplacements. Of far-reaching consequence, however, is the fact 
that, in the absence of measurement or direct comparison, errors of 
classification at group boundaries by any one individual must in general 
be expected to increase as the number of border lines between classes 
increases, that is, as the grouping becomes finer. Let many individuals 
essay independently to make such a classification of a single set of sub- 
jects, guided only by such descriptive terms as ‘‘very tall,’’ et cetera, and 
the confusion when all results are assembled may be expected to become 
quite considerable. 

There is no difficulty in imagining such a situation provided that the 
descriptive terms used are open only to subjective interpretation. Shift 
the variable under discussion from height to health, and one immediately 
faces a much more complex situation. Stature is an elementary variable 
portrayed by a single dimension, whereas health represents a composite 
of many unitary and interacting variables which for the most part are but 
dimly understood. What seems good health to a doctor is often believed 
to be poor health by a patient, and vice versa. That which comprises good 
health at one age may quite properly be regarded as poor health at 
another age. It is virtually impossible to reach general concordance of 
judgment in subdividing variation in the state of health into even very 
few subgroups. The use of descriptive terms open to varying subjective 
interpretations introduces of necessity an element of confusion into 
classification, whatever may be the character under study. 

THE LANGUAGE OF NUMBER 

Universal assent, or even a satisfactory approach to it, cannot be 
obtained without standardization of the terms in which description is 
made. The meanings of a large proportion of words and phrases used in 
qualitative description are susceptible of only temporary definition which 
at best may prevail rigidly within rather limited geographic bounds. 
Anti-vaccinationists today correctly quote Jenner himself as stating that 
smallpox vaccination may be accompanied by the development of 
“tumours” in the armpits! But the implication of the word “tumour” is 
quite different today from what it was in Jennet’s time. Only in a rela- 
tively loose sense may subjectively intcrpretable words serve as a vehicle 
of scientific description. Units of dimension, on the other hand, have 
proved capable of universal and permanent standardization. They also 
have the property of unlimited divisibility without any loss of meaning, a 
quality quite foreign to words. 

Description in terms of dimension allows one to pass directly from the 
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multitudinous verbal languages, with all their elements of subjective 
definition, to the single universal language of number^ a language of un- 
limited scope and yet of absolutely fixed meaning. It is only, natural that 
in their evolution all sciences have moved toward the greater and greater 
use of measurement and number as the vehicle of precise and objective 
definition. On this matter Quetelet ‘ has remarked : 

The more advanced the sciences have become^ the more they 
have tended to enter the domain of mathematics, which is a sort of 
center towards which they converge. We can judge of the perfection 
to which a science has come by the facility more or less great^ with 
which it may be approached by calculation. 

Number may arise in description with or without the use of measure- 
ment, Fundamental measurement may be defined as the counting of 
standard units and fractions thereof to fix a dimension. But entities or 
occurrences such as children per family, deaths among cases of typhoid 
fever, et cetera^ which are perfectly definite in themselves, although in- 
divisible, may be counted as precisely as units of measurement. Thus we 
have two clear-cut systems of numerical description. In one, the unit is 
fixed and divisible; in the other, it is definitely recognizable as an indi- 
vidual or an attribute of an individual, but it is indivisible. Both 
systems, however, provide a secure quantitative foundation for scientific 
inference, 

VARIATION 

The incentive to describe rests on the recognition of differences and 
the desire to distinguish clearly between variants. The taxonomist 
readily recognizes certain differences between two groups of plants or 
animals and uses those differences as a basis for the designation of two 
species. Then finer description shows that no two individuals in the one 
species are precisely alike. Is every plant or animal to be differentiated 
from every other one in taxonomy, or will it be adequate to recognize the 
existence and define the scope of variation within species? 

The physical chemist, recognizing that elements differ, essays to find 
the atomic weight of each. In repeated determinations on any on^ he 
fails to secure completely concordant results. Are elements to be con- 
sidered merely as classes of materials like biological species^ within each 
of which variation may occur? 

These two types of experience are found to be perfectly general 
throughout all the sciences. Variation seems universal; obvious and 
inescapable in some cases, it is quite illusive in others. With the attain- 

^ Helen M. Walker. Studies in the history of statistical method. Baltimore: 
Williams and Wilkins. 1929. Vide page 39. 
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merit of more precise levels in description through counting and measure- 
ment, and through the constant improvement in sensitivity of measuring 
devices, it is not surprising to find that qualities which may at first have 
seemed identical appear later to differ among themselves. 

The two specific examples given above of variation within types serve 
to distinguish two general sources of variation. On the one hand, things 
may differ in fact ; and, on the other, repeated measurements of a thing 
believed constant may lack perfect consistency. To designate these two 
classes of variation as “authentic’’ and “illusory” is not very practical. 
Every measurement is liable to error, and therefore ^authentic” variation 
is never seen uncontaminated by the ‘ ‘illusory.’ ’ In this latter connection 
it has pointedly been written:^ 

No measurement of any real thing can ever be correct, for the 
simple reason that no instrument is capable of infinitely small dis- 
placements an(bQO human eye can detect infinitesimal separations. 
Errors are therefore inevitable. 

One may be content to recognize that variation in repeated measure- 
ments may be ascribed to two different sets of causes : 

(1) the inherent md ' environmental inffuences in response to 
which the character in question has attained its magnitude; and 

(2) the errors made by the observer and his measuring instru- 
ments collectively in attempting to determine the magnitude of the 
character. 

Variation of one or both types must always be expected to underlie all 
observations. It is the objective of the statistical method to analyze such 
variation, and in view of that analysis to provide dependable bases for 
the formulation of certain types of inference. 

Free communication of scientific thought and descriptive technique 
permits each worker in modern times, whatever his field may be, to 
profit quickly from the experience of other scientists. Precision methods 
of objective description developed in some one field now quickly influence 
accomplishment in others. As a result the demand for precise measure- 
ment, long established in the physical sciences, has grown apace in bio- 
logical research. The quantitative data secured through counting or 
measurement in biology have, however, been frankly characterized by 
great variation in contrast to the relatively small variations observed in 
the “precise sciences.” The data have called for new methods of analysis 
resting upon the secure foundation of mathematics, the logic of the 

^ R. W. M. Gibbs. The adjustment of errors in practical science. New York, 
Oxford University Press. 1929 . 
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quantitative. These methods have been developed very largely by 
mathematicians attracted by the opportunities on biological frontiers. 
Designated as biometric or staiisiical methods of analysis, they have in 
recent years pervaded sciences wherein, before, only the most farsighted 
dreamed their presence could be tolerated, let alone be profitable. 

PRECISION OF MEASUREMENT 

The accuracy of mathematical deductions from data must inevitably 
be limited in some way by the precision and adequacy of the observations 
themselves. Emphasis must always be laid, therefore, on the necessity 
for the exercise of critical judgment in the collection and recording of 
data. Precision in measurement and care in classification materially 
enhance the value of observations. 

It must not be inferred, however, that it is impossible to overdo these 
things. Overminuteness is mischievous. To measure adult stature to 
one-hundredth of an inch without determining the accuracy of the instru- 
ment used or the reliability of the observer's judgment, or without record- 
ing the physical state of the subject, is obviously foolish. Quite analogous 
overminuto measurement in other connections is, however, not at all un- 
common. False confidence in the validity of individual measures char- 
acterizes all too many pieces of work. It is effort well spent to determine 
accurately the trustworthiness of measurements and classifications as a 
special project before extensive recording is undertaken. Statistical 
experience in the analysis of errors of observation is an excellent teacher 
concerning misplaced confidence in measurement. 

Ample room exists for more extensive exercise of critical judgment in 
determining the degree of refinement which is worth while in describing 
phenomena. Limits exist beyond which precision of measurement at the 
expense of number of observations is far too costly. One hundred mea- 
surements made to two significant figures may well be of much greater 
value to scientific analysis of a phenomenon than only ten measurements 
made to three or four significant figures. When the error of measurement 
is relatively small in comparison with the total range of variation of'the 
measures, that error may well be quite unimportant. For the most part 
at least, it is far more consequential to define a variation system well by 
having many determinations, than it is to have the individual determin- 
ations made with great refinement in accuracy. That which is important 
to attainment of a broad scientific objective is too easily veiled by the 
personal satisfaction which great refinements in technique of execution of 
measurements may give. 
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THE DESCRIPTIVE VALUE OF MEASUREMENTS 

As the complexity of phenomena increases, so the difficulty of deter- 
mining measures which adequately portray each phenomenon increases. 
The task of determining the amount of iodine in a tissue is relatively 
simple when compared to that of measuring the intelligence of man. The 
latter variable is only in a limited way susceptible of quantitative evalu- 
ation with existing devices. Any measurements secured reflect only 
certain features of intelligence. One all too commonly encounters mea- 
surements which by terminology purport to describe rather fully the 
phenomenon to which they are applied when in fact the coverage is 
rather meager. One may venture the opinion that one of the most pro- 
nounced weaknesses in scientific inference, past and present, is over- 
generalization by the would-be servants of science. Careful scrutiny of 
all measures to determine their limitations as descriptions of the phe- 
nomena being investigated is greatly needed among workers in the broad 
fields of biological science, particularly those engaged with problems in 
the social, psychological, and physiological behavior of human beings and 
their less fortunate proxies in experimental investigation. 

THE SAMPLE AND THE SUPPLY 

In gathering sets of measurements of a specific type, the scientist is 
motivated by a desire to learn something about the character measured, 
not alone in the individuals measured but in such individuals generally in 
the universe. Assuming that the measurement, is appropriate to the 
character in question, and that it has been made with adequate accuracy, 
he has before him a description of the character in the individuals he has 
measured. Let it be assumed that the measurement is total milk yield 
from each of a number of cows of a specified breed during a fixed period 
following parturition. His interest in any use of those measurements 
does not pertain fundamentally to the individual animals tested, but to a 
much larger universe of measurements of which he believes his group to 
be representative. The particular group is just a sample from some 
indefinitely large papulation, universe, or supply of such measures, all of 
which he would prefer to have and use if there was no serious obstacle in 
the way of his doing so.® 

Economy with respect to time if not finance must be expected to 
limit rather sharply the number of measurements that may be secured in 
the analysis of any problem. Accordingly, a sample is selected to form a 

® The technical problem of the computational difficulty involved when enormous 
numbers of measurements must be handled is not pertinent here. 
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basis for broader interpretation. From the data comprising the sample 
the scientist wishes to draw conclusions concerning the supply of which 
that sample is representative. He wishes to make sound inferences with 
respect to a general body of data from reasonably precise knowledge of a 
relatively few individual measurements, a particular set. This is inductive 
reasoning, the inverse of deductive reasoning wliich passes from the gen- 
eral to the particular. New knowledge of the universe is always acquired 
through inductive inference from samples drawn from the universe, and 
statistical analysis forms a path along which such scientific inference may 
proceed with security. 

As a basis for such inductive processes of thought, the samples must 
be representative of the supply in some specified manner. Accordance of 
the sample with such specifications is essential to the accuracy of the 
inductive process. The only generally adaptable specification of repre- 
sentativeness of a sample is that the individual measures comprising it 
should form perfectly random selections from the universe of reference. 
By perfectly random selection of a sample from a supply, one means, 
strictly speaking, that every individual in the supply has had equal 
opportunity of being drawn as a constituent member of the sample. All 
differences between samples thus drawn at random from the same supply 
are attributable solely to errors of random sampling. One of the prime 
functions of statistical methods is to determine the probability of any 
given difference between samples occurring because of such errors of 
sampling. 

Random representativeness of samples is a prerequisite to the formal 
procedures of statistical analysis. If it should prove possible to select a 
sample that would be exactly alike in all characters to the universe of 
measurements from which it is drawn, then the statistical problems of 
induction would vanish and solutions concerning the populations could 
be given with certainty. Such a type of selective sampling from this 
universe is wholly impossible. It would require precisely that knowledge 
concerning the supply of measures which is unknown, determination of 
which forma the objective of the investigation. 

In most practical research, the sample arises through certain circum- 
stances rather than by any technique of random selection from a pre- 
viously defined supply. In such cases, the supply which it may be 
considered to represent as a random sample must be defined by the 
investigator. This definition naturally arises from subjective judgment 
which, is very hkely io he colored hy “wishlol ’ L\rxdls.l\ons ol 

the sample have not infrequently been overlooked in published investiga- 
tions, and accordingly some conclusions have been drawn which are 
obviously spurious. Great care should be exercised to make sure that 
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statistical conclusions are drawn with reference only to the population of 
which the sample is truly representative in the sense just defined. The 
statistical method requires intelligent application to yield dependable 
results. 


SYMBOLIC REPRESENTATION OF DATA 

An invaluable contribution of mathematics to the sciences is its pro- 
vision of a highly developed system of symbolic reasoning. The economy 
of thought and clarification of ideas which it makes possible constitute 
ample justification for its extensive use in the analytical reasoning of 
biology. In his introductory editorial to Biometrika, Karl Pearson wrote : 

, . . symbolic analysis widens our notions, it leads us at once to new 
points of view and it directly suggests fresh points for observation and 
novel directions for experimental research. 

The truth of this statement is amply substantiated by the experience 
which all may readily acquire who blaze trails on the frontier of quantita- 
tive biology. Its power lies in simplification of statement, the portal to 
clear thinking. 

Just what character is employed as a symbol for any entity is, of 
course, entirely optional. For convenience one usually chooses familiar 
symbols, such as the letters of alphabets and signs well enough known by 
usage to be generally known by name. There is not any necessity to 
retain a selected symbol for the same object of reference outside of the 
particular piece of analysis being conducted. Thus one may use the 
symbol x to represent the red blood cell count of healthy women students, 
or the protein content of samples of fiour, or the calculated distance be- 
tween two stars, or the number of days taken by a crop to reach the 
flowering stage, and so on. x may designate any character to which one 
wishes to refer repeatedly. Diversified interest will probably necessitate 
the use of several symbols in any one investigation, so that u, v, x, y, et 
cetera, may each represent a different entity. In such cases it is often 
helpful to employ symbols that are automatically suggestive in some way 
of te object designated. 

In the following pages the symbols x, y, and so forth, designating 
variable quantities, are to be considered in a very general sense from the 
standpoint of research. As far as is convenient an effort will be made in 
general symbolism to follow the common mathematical practice of using 

to designate variables. On occasions this convention proves hampering, 
and it may of course be dispensed with at will in favor of some more 
suitable alternative. 
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Let X, then, designate a chosen quantity which changes in magnitude 
from one individual to another, or from one determination to another on 
the same materials. It designates the variable character rather than any 
specific value of it. Such characters will be called variables herein. It is 
necessary to consider more than one value of a:, “more than one cherry of 
the crop.” Indeed, it is usually desirable to consider as many values of 
aj as it is economically feasible to secure. The sequence of symbols, 

Xly X2j X3, *•*, X}r, 

may then represent in abbreviated form the series of N individual mag- 
nitudes comprising the sample. We shall herein call these individual val- 
ues of Xf the variates. It is to be noted particularly that the order of the 
subscripts does not necessarily imply order of magnitude. xi is just one 
of the N values of x forming the sample, X2 is another magnitude, and 
so on. The series merely indicates that there are N variates in this 
sample of the variable x. 

It may not be amiss to emphasize again just what the purpose is in 
measuring these N individuals. The scientific objective would be to find 
out what one can about them, not for their own sake but as a basis of 
generalization concerning the character x. These generalizations are to 
constitute a step in scientific description of the universe. To the owner 
of N cows the individual yields are a matter of considerable importance. 
He is interested in them as such for economic reasons. The scientist 
would be interested in the yields merely as an example of what such cattle 
in general may be expected to produce. Suppose these cows to be Jersey 
cows, and yi, y2, ys, "‘,yN to be milk yields for a herd of Guernsey cows. 
Is the milk yield of the Jersey breed higher than that of the Guernsey 
breed? In looking at the matter from the standpoint of science, one 
wishes to base a generalization on the observed values of x and y. One 
wishes to determine which is the superior breed in the characteristic of 
milk yield. The two series of records are but samples from two popula- 
tions that are known to differ in many characters. It is desired to de- 
termine if possible whether that differentiation extends to the character 
in question. 

SIZE OF SAMPLE 

The dependability of any generalization which is made from impartial 
study of a sample drawn at random from a universe of discourse in which 
all magnitudes are not identical clearly must decrease as the number of 
individual measurements in the sample decreases, all other faetors being 
equal. Paucity of information given by the basic data cannot be ad- 
justed by mathematical corrections in the analysis of those data. The 
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trustworthiness of any generalization is a direct function of the number 
of cases upon which it is based. The desirability of avoiding the use of 
small samples when dealing with wide variation, except where it is not 
feasible to secure larger numbers of observations, should not be over- 
looked, for the hazards of interpretation intensify rapidly as N becomes 
very small. Detailed discussion of these hazards will be left for a later 
volume; the immediate concern is to stress the fundamental nature of the 
reasoning given in the first sentences of this paragraph. The student will 
do well to consider the point very carefully. 



CHAPTER 2 


THE LAW OF FREQUENCY DISTRIBUTION 

Cliaos is a state of lack of arrangement; that which appears to be 
chaotic may well be readily susceptible of orderly organization. Varia- 
tion among physical and biological magnitudes appeared to scientists for 
a long time to be chaotic until study of methods of systematization of 
data showed that the quality of orderliness may be extended to the 
phenomenon of variation in all its forms. This revelation opened tre- 
mendously fertile fields for the expansion of quantitative description 
and the increase in precision of inference in the biological sciences. 

Orderliness in variation appears to have been first appreciated by 
the mathematical astronomer as a characteristic of data accumulated 
in the precise sciences. The gradation of errors made in the determina- 
tion of spatial relationships of the heavenly bodies was observed by them 
to have a characteristic pattern coinciding with that of a mathematical 
function. Developed by the French mathematician, Laplace (1749- 
1827), and first applied to astronomical errors by the German mathema- 
tician and astronomer, Carl Friedrich Gauss (1777-1855), this function 
has become known variously as the “law of error,” the “Laplace-Gaus- 
sian curve,” or more recently as the “normal curve.” It may well be 
called the principle cornerstone of the modern statistical edifice. Credit 
for discerning and primarily developing the application of this function 
to hioiogicai variations belongs to Adolphe Quetelet (1796-1874), the 
Belgian scientist distinguished for wide accomplishment growing from 
his tremendous breadth of interest. His dynamic sponsorship in Europe 
of this new “statistical” method of inquiry into biological and social 
vVvmomcna biWVvanWy ioWowed \n V^ngland \>y Francis Gallon 
(1822-1911), whose enthusiasm was unbounded as he verified the prin- 
ciple of systematic variation in every biological variable for which he was 
able to accumulate adequate data. 

Revelation of this principle of orderUuess in biological variation 
formed the beginning of a new era in biological research, Galton writing 
with characteristic charm of the discovery in his epoch-making book 
“Natural Inheritance” (1889) : 
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I know of scarcely anything so apt to impress the imagination as 
the wonderful form of cosmic order expressed by the “Law of Fre- 
quency of Error.'’ The law would have been personified by the 
Greeks and deified, if they had known of it. It reigns with serenity 
and in complete self-effacement amidst the wildest confusion. The 
hiiger the mob, and the greater the apparent anarchy, the more 
perfect is its sway. . . . Whenever a large sample of chaotic elements 
are taken in hand and marshalled in the order of their magnitudes, 
an unsuspected and most beautiful form of regularity proves to have 
been latent all along. 


FIGURE 1 

Frequency Distribution of 500 Measurements of the 
Width of a Spectral Band of Light * 



Measured width of light band in thousandths of a millimeter 


The “law of frequency of error” to which Galton referred, pertains to 
the change in frequency of occurrence of the successive magnitudes of a 
variable Vhen they are placed in order. As one passes along the scale of 
measurement of a variable, from the smallest magnitude to the largest 
observed in any sufficiently extensive series, one finds an orderly change 
in the frequency with which each successive magnitude occurs. Most 
commonly the frequency of occurrence of each magnitude increases 
progressively with advance along the scale until a maximum is reached 
somewVeTe vn tWe centraY regYon oi Uae lange and Wien an oxdeWy iaWing 
away in frequency is observed until it finally disappears. A typical 
example of this characteristic in errors of measurement is provided in 
Fig. 1, wherein the frequency distribution of 500 determinations of the 
width of a spectral band of light is graphically presented. Among 
biological variables, anthropometric measures tend to follow fairly 
closely this symmetrical “law of error” function used first by Gauss. 


Data by courtesy of Professor Raymond T. Birge, University of California. 
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This is illustrated in Figs. 2a and 2b, where the same graduating curve 
used in Fig. 1 is applied to finger-length records secured from the Scot- 
land Yard criminal identification files, and to scores made by a class of 
449 students on an examination in “current affairs.’* In these three 
diagrams of frequency distribution the heights of the rectangles corre- 
spond to the frequencies observed within the designated ranges on the 

FIGURE 2a 

Fbkqubncy Distribution fob Length of Left Middle Finger * 



FIGURE 2b 

Frequency Distribution for Marks Secured on a "Current Affairs” 
Examination bt 449 Students f 



scales of measurement given along the base. The curve in each case 
represents a mathematical smoothing of the polygon of rectangles. It 
portrays the statistician’s estimate of the frequency distribution in the 
assumedly infinitely large supply of such measures which conceivably 
might have been made in each case, and from which the samples may 
be considered as being assembled by random selection of individual 
measures. 

* For basic data, see Table 3, page 19. 

t Data by courtesy of the Committee on Educational Research, University of 
Minnesota. 
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Galton erred if he regarded the ^^law of frequency of error” as being 
descriptive of biological variations in general. Not all variables show 
this particular curve's specific characteristics in their frequency grada- 
tion. The type of mathematical function may change freely from one 
variable to another. However^ all do show orderliness of frequency dis- 
tribution in some related form. The law of frequency didrihviiony 
namely, that the frequency is a smooth mathematical function of the 
magmtude of a variable, seems truly universal in its sway. Let the 
variates be “marshalled in order” through seriation, and the form will 
unfold. 

CONTINUOUS AND DISCRETE VARIABLES 

The scales in terms of which the magnitudes of variables are numer- 
ically expressed are of two distinct general types, and biological varia- 
tion may accordingly be classified into two categories on this basis. The 
egg production of the domestic fowl may be appropriately described in 
terms of the total weight of eggs laid, as well as in terms of the number 
of those eggs. The former is a continuous scale of measurement, theo- 
retically capable of division into infinitesimal increments. The scale of 
weight proceeds continuously from any one specified unit to the next; 
there are no gaps or jumps. Any variable described in terms of such a 
scale may accordingly be designated as coniinucmB in its variation. 

The other scale is that of the integral numbers. It is not continuous 
but proceeds by discrete increments of one integer, or complete units. 
It is the basic scale of counting with which fractions are incompatible. 
Variables whose magnitudes are expressed on this scale of pure numbers 
are designated as discrete or discontinuous. 

The general forms of distribution of frequency for both types of 
variable follow the same orderly patterns; the law of frequency distribu- 
tion is independent of these scale characteristics. This will appear in 
that which follows. 


SERIATION 

T^e process of gathering together like magnitudes of a variable to 
form a frequency tabulation is known as seriaiion, growping, or classi- 
fication. It is a procedure of most simple character for any discrete 
variable, the scale of integers automatically defining for them a seriation 
scheme. Thus, using data quoted by Maynard,^ the variation in num- 
ber of children in completed families in New Zealand in 1916 is system- 
atically portrayed by the adjacent seriation (Table 1) resulting from 
analysis of the original census data. Each possible value for number of 

* G. D. Maynard. A study in human fertility. Biometrika, 14: 337-354. 1923 
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children per fertile marriage from one up to the maximum recorded (19) 
is recognized. Fractional values being impossible, this seriation involves ^ 
no approximation; it is precisely definitive of all observed values. 

TABLE 1 

Freqtjenct Distribution for Number of Children 
IN Completed Families.* New Zealand, 1916 

(Data from Maynard^) 


Number of 
children 

X 

Number of 
families 
/ 

1 

175 

2 

258 

3 

393 

4 

421 

5 

465 

6 

462 

7 

413 

8 

428 

9 

381 

10 

336 

11 

237 

12 

155 

13 

108 

U 

70 

15 

25 

16 

14 

17 

3 

18 

4 

19 

3 


4,351 


In the case of any continuous variable, on the other hand, some unit 
of classification must be determined in advance and seriation made 
according to the arbitrary grouping so established. It is necessary to 
recognize clearly that the recorded magnitude of a variate on a continu- 
ous scale is only an approximation to its true value. The degree of such 
approximation is determined by the precision attained with the instru- 
ments employed in the measurement process. The recorded values of 


* These data refer to fertile unions only, for mothers married at 20 years of age who, at the time 
of census, were between 45 and 65 years of age. The families may reasonably be considered to be 
completed, any corrections being of triSing conseQuence to the form of the distribution. 
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the variable appear to be discontinuous or proceed in “jumps/^ those 
jumps being equal to the smallest increment or graduation recognized in 
making the measurement. This increment therefore determines the 
smallest possible unit of classification of a series of variates on a continu- 
ous scale. It automatically imposes a classification on the truly con- 
tinuous scale. 

We may consider the head breadth measurements made on 1,000 
Cambridge University students ^ as illustrative material. These meas- 
urements were made to the nearest tenth of an inch, and ranged from 

TABLE 2 

Frequency Distribution for Head Breadth in Cambridge Men 


(Data from Macdonell^) 


Head breadth 
in inches 

X 

Frequency 

/ 

5.5 

3 

5.6 

12 

5.7 

43 

5.8 

80 

5,9 

131 

6.0 

236 

6.1 

185 

6.2 

142 

6.3 

99 

6.4 

37 

6.5 

15 

6.6 

12 

6.7 

3 

6.8 

2 


1,000 


a minimum of 5.5 inches to a maximum of 6.8 inches. The scria- 
tion imposed by the increment of measurement is given in Table 2. 
Since measurement was made to the nearest tenth of an inch, the 236 
cases recorded as 6.0 inches, for instance, really fall over the range 5.05 
to 6.05 inches. No attempt was made to differentiate them, all being 
recorded as the same magnitude. 

More often than not, measuring instruments commonly used in the 
biological sciences are graduated and read to very small increments in 

* W. R. Macdonell. On criminal anthropometry. Biometrika, 1 : 177-227. 1902. 
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comparison with the total range of variation encountered. Under such 
conditions the increment of measurement provides a seriation scheme, 
with a very large number, probably an altogether unwieldy number, of 
classes in which frequency is to be recorded. It then becomes advan- 
tageous to establish classes of broader range for seriation purposes, 
classes that have a range of some simple and preferably constant mul- 
tiple of the increment of measurement. Selection of that multiple will 
be quite arbitrary, and may be made according to the purposes which 
the seriation is to serve. A case involving a small multiple may be used 
by way of illustration. 

In the study just referred to, Macdonell ^ also gives data abstracted 
from the records of the Central Metric Office, New Scotland Yard, for 
length of left middle finger (measured to the nearest millimeter) in 3,000 
prisoners. Sedations of these data are given in Table 3 in two classi- 
fications: (1) according to that imposed by the increment of measure- 
ment, and (2) in broader classes of three such intervals. Although the 
recorded measurements proceed by discrete steps of 1 mm., finger length 
is of course a continuous variable. A recorded value of 95 mm., when 
properly interpreted, means merely that the true value lies within the 
range of 94.5 to 95.5 mm. If the measuring instrument had been grad- 
uated only in increments of 3 mm., starting, say, at 92 mm., then the 
records would have followed the seriation given on the right-hand side 
of the table. Thus, grouping of data into broader classes than those 
imposed by the increment of measurement is simply - the equivalent of 
having used a coarser scale of graduations in measurement. 

It will be clear that measurement of any continuous variable involves 
errors of approximation, and seriation of such measurements into classes 
of broader range will have the effect of magnifying those errors. Their 
influence, however, may be reduced to an essentially negligible level by 
appropriate correction when the refinement of the investigation war- 
rants it. These “corrections for grouping” are rarely of importance in 
biological research, particularly if there be, say 10 or more classes in the 
final seriation scheme. 

Seriation plays a very important role of initial simplification in the 
statistical reduction of data. It permits of the replacement of N 
individual measurements by a very much smaller number of classes of 
measurements. In the absence of a calculating machine, seriation may 
be a decided help in computations for all but very small series of values. 
The number of classes which is desirable is a matter for individual judg- 
ment based on the nature of the problem in hand. It might be stated as 
a rough guide that from 8 to 15 classes are suitable for most practical 
purposes, with preferably an average frequency per class of 10 or more. 
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TABLE 3 

Frequency Distribution for Length of Left Middle Finger in Criminals 


(Data from Macdonell*) 


Recorded 
value 
in mm. 

True range 

Fre- 

quency 

Class range 

Class 

center 

X 

Fre- 

quency 

/ 

94 

93.5- 94.6 

0 




5 

94.5- 95.6 

1 

93.5- 96.5 

95 

1 

6 

95.5- 96.5 

0 


7 

96.5- 97.5 

0 




8 

97.5- 98.5 

1 

96.5- 99.5 

98 

4 

9 

98.5- 99.5 

3 




100 

99.5-100.5 

7 




1 

100.5-101.5 

7 

99.5-102.5 

101 

24 

2 

101.6-102.5 

10 




3 

102.5-103.5 

17 




4 

103.5-104.5 

20 

102.5-105.5 

104 

67 

5 

104.5-105.5 

30 



6 

105.5-106.5 

44 




7 

106.5-107.5 

74 

105.5-108.5 

107 

193 

8 

107.5-108.5 

75 



9 

108.5-109.5 

102 




no 

109.5-110.6 

163 

108.5-111.5 

no 

417 

1 

110.5-111.5 

152 



2 

111.5-112.5 

183 




3 

112.5-113.5 

164 

111.5-114.5 

113 

575 

4 

113.5-114.5 

228 



5 

114.5-115.5 

233 




6 

115.5-116.5 

226 

114.5-117.5 

116 

691 

7 

116.5-117.5 

232 i 




8 

117.5-118.5 

184 : 




9 

118.5-119.5 

162 

117.5-120.5 

119 

509 

120 

' 119.5-120.5 

163 




1 

120.5-121.5 

126 




2 

121.5-122.5 

91 

120.5-123.5 

122 

306 

3 

122.5-123.5 

89 




4 

123.5-124.5 

44 




5 

124.5-125.5 

52 

123.5-126.5 

125 

131 

6 

126.5-126.5 

35 




7 

126.5-127.5 

31 




8 

127.5-128.5 

25 

126.5-129.5 

128 

63 

9 

128.5-129.5 

7 




130 

129.5-130.5 

8 , 

129.5-132^ 



1 

130.5-131.5 

2 

131 

16 . 

2 

131.5-132.5 

6 I 




3 

132.5-133.5 

2 




4 

133.5-134.5 

0 

132.5-135.5 

134 

3 

135 

134.5-136.5 

1 






3,000 



3,000 
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As will be shown shortly^ seriation with broad grouping is often necessary 
for appropriate graphical representation of frequency distributions. 
When a calculating machine is available; it may not be advantageous 
to bother about seriation for purely computational purposes unless 
the number of measurements is of the order of 100 or more. 

There remain three matters affecting seriation which certainly war- 
rant some remark in this brief discussion. 

(A) Broad grouping of a discrete variable. One may on occasion 
choose to seriate a discrete variable into classes of range greater than the 
unit increment of the number scale. One may consider the red blood 
cell count in man, by way of illustration. Usually made per ten- 
thousandth of a cubic millimeter of blood, these may vary over a range 
of several hundred cells between the smallest and largest count. Ser- 
iation by integers would obviously give too many groups. One may 
well use ranges of 20 or more blood cells to a class to bring the number of 
classes down to a reasonable figure. 

(B) Irregular grouping. A constant range for all classes in the 
seriation scheme has very definite simplification value in computational 
procedure, but it is by no means necessary at all times. Indeed, it is 
common practice in social statistics to use irregular grouping for many 
variables. Table 4 gives a type of irregular classification commonly 
appearing in vital statistics, wherein the groupings may be expected to 
be dictated in part by homogeneity of interest within them from a dis- 
ease point of view. In the statistical study of his data the student of 
vital statistics may readily free himself of the handicaps of irregular 
grouping by making comparison between groups in terms of "rates,” to 
which some attention will later be given. 

(C) True class range. In the seriation of continuous variables, the 
range of each class appears to be the interval from the lowest to the high- 
est recorded measurement for each class. This is not the full range of 
the class on the continuous scale, however, for it would leave vacant 
gaps between the ending of one class and the beginning of the next. 
Since each recorded measurement actually refers to a range about a 
reference point, allowance must be made for the increment of measure- 
ment in establishing the true class range or class center. 

Where measurement is made to the nearest graduation on the scale, 
the true class range is the apparent range extended at each end by one- 
half of the increment of measurement. Some workers, however, prefer 
to ignore the final fractional increments in measuring; they record the 
graduation below rather than the "nearest” graduation. In stating 
age it is a common practice for all to give the age last birthday rather 
than to determine the nearest birthday. This method of recording has 
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the obvious advantage of avoiding the exercise of subjective judgment as 
to which is the nearest graduation (or the effort of computing which is 
the nearest birthday) when the true value falls in the mid-region between 
graduations. On the other hand, when a graduation falls very close to 
the true value, precise measurement may be demanded to determine 
whether that graduation is really below or above the true value. When 
the terminal fractional increment is ignored in measurement, it must be 


TABLE 4 

Distbibution by Age of the White Population of Minnesota as Recorded 
IN THE 15th IJ. S. Census, 1930 


Age in years 

Population 

Native born 

Foreign born 

Under 1 

43,397 

10 

1- 4 

184,407 

211 

5- 9 

252,492 

1,419 

10-14 

249,364 

2,013 

15-19 

233,032 

4,670 

20 24 

203,628 

8,781 

25-29 

177,780 

13,747 

30-34 

169,946 

17,985 

35^4 

291,640 

70,680 

45-54 

176,938 

89,81 L 

55-64 

102,277 

81,459 

65-74 

50,241 

66,376 

75 and over 

14,728 

30,994 

Unknown 

809 

138 

Total 

2,150,679 

388,294 


observed that the recorded value indicates that the true value lies 
between the recorded value and the next higher graduation on the 
scale •of measurement. In such cases the true range of a class in a 
seriatiori scheme is the apparent range extended at the upper limit by 
one increment of measurement. In Table 4, for example, the true class 
ranges are 0-0.9, 1-4.9, 5-9.9, , and the class centers are 0.5, 3.0, 

7.5,.... 

Errors of approximation balance one another when measurement is 
made to the nearest graduation, whereas those errors give a negative 
bias to all measurements made to the graduation below. A desire to 
record a value as near as feasible to the true value seems to express itself 
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naturally in general practice. In the absence of standardization con- 
cerning the disposition of fractional increments in practical measure- 
ment, the statistician must be alert to see that he ascribes the true class 
center and range to each frequency, 

SERIATION PROCEDURE 

The technique of seriation commonly employed for simple series of 
data recorded on sheets is to make in a suitable blank table a consecutive 
list of the seriation classes decided on, then enter each variate in its 
appropriate class by a short vertical tally stroke. The use of a cumula- 
tive cross stroke for every fifth tally in a class aids the final addition. 
The seriation of weight per bushel in 59 samples of wheat by this method 
is illustrated in Table 5. This tally method is likely to be quite tedious, 


TABLE 5 

Illustrating Seriation of a List of 59 Weights per Bushel of Wheat 


Apparent 
class range 

Tally 

Class frequency 
/ 

Class center 

X 

55.0-55.4 

// 

2 

55.2 

55.5-55.9 

//// 

4 

65.7 

66.0-56.4 

/// 

3 

56.2 

56.5-56.9 

/// 

3 

56.7 

57.0-57.4 

/tV/ A/W f/Y/ / 

16 

67.2 

57.5-57.9 

tW //// 

9 

57.7 

58.0-58.4 

' /YV /Y/ / 

11 

I 58.2 

58.5-58.9 

, Y/// 

6 

58.7 

69.0-59.4 I 

1 // 

2 

^9.2 

59.5-59.9 

/// 

3 

59.7 



59 



however, with large series of data. It also suffers the disadvantage of 
requiring a complete duplication of the work to secure a check on the 
accuracy of the seriation. 

A much more satisfactory procedure is achieved by sorting cards 
having the raw data entered on them in a suitable form. This is par- 
ticularly true when two or more variables are to be correlated. In such 
cases it is well worth while to design a suitable card first and transcribe 
all the data onto such cards. The N observations on the primary vari- 
able are entered one each on a separate card, and with these are also 
entered the associated values for any other variables to be considered in 






SERIATION PROCEDURE 


23 


the analysis. With several variables, cards may be specially printed 
with the variable designations if the project warrants it. The card 
reproduced in Fig. 3 is illustrative. It was prepared for a series of 
several thousand cases of obstetrical confinement, 37 variables in all 
being considered. 


FIGURE 3 

A Statistical Record Card with Printed Variable Designations, 
Prepared for an Obstetric Study 


M. G. H. Case No. 30 ^ 

2oa_ 


Series 

MOTHER: Age ZO 

F. Menst. 

/it, CHILD: Sex F 1 

HL 66 

Interval 

R. 4rp. 

Lth. SO 

Wt. A. P. /03 

Int. days 


Wt. SU.'fS' 

Wt. P. P. /OL’S' 

Duration 


M Age XS? 

Intersp. 2^ 'S 

Pains B 

_ 

Gravida. Z 

Intercr. Z3'S 

Severity 


Para. Z 

Rt. Obi, 2 it.' S’ 



Pos.: In. QJ^A 

WmUBSKffWIM 

L. Menst. 


o 

O 

2k 

Ext. Conj. ZO 

Qclc. 


N. B. Heart — 

Diag. Conj. /Z'S" 

Labor I 

C: 

Fetal Heart /U.t 

Obs. Conj. 

II 

0-lf.l^ 


Interiub. /O 

III 

o: jr 

. X ? 

Post Sag. S'S* 

Date Del. 

3 ,U-30 



When only very few variables are to be studied, cards with printed 
titles are hy no means necessary. A key card may be prepared to define 
the positions for entry of the data on blank cards, of the same size, as in 
the following specimens. Figure 4 shows the key card for a set of three 
chemical determinations made independently by two analysts on a series 
of flour samples. Figure 5 reproduces the record card for Sample 1895 
of the series. There can be no confusion here as to the identification of 
the numbers so long as the key card is preserved. 

It is a very simple matter to sort such cards into stacks on a flat sur- 
face according to some seriation scheme. Each stack may be checked 
separately with ease to see that its variates fall within the specified class 
range. The number of cards in each stack may then be counted and 
the frequency entered directly into its proper place in a prepared table. 
The cards may readily be preserved in seriation by cross-stacking if 
desired, being then ready for reseriation within each group for a second 
variable to be correlated with the first, and so on. 
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Tremendous strides have been made in the last decade or two in the 
development of mechanical recording and card-sorting systems. These 
depend primarily on careful planning of a standard sized card to be 

FIGURE 4 

A Specimen “Key Card” for Identification of Entries on Similar 
Cards Without Inscribed Variable Designations 
(Corresponding Record Card Below) 



A Specimen Record Card From a Short Series Investigation 
(Data Identified by the “Key Card” of Figure 4) 


ms 



13.00 


0.4^ 

n.os 

//f.70 

o.kj 


punched with holes in positions specified to accord with a fixed plan of 
seriation for each variable. Sorting then proceeds by electrical contacts 
made through the holes as each card passes between wire brushes and a 
plate as electrodes, the position of the hole determining the destination 
of the card in the receiving mechanism. Interlined cards permitting of 
the inclusion of written records further enhance the scope of the system. 
The utility of the method in very large-scale investigations is well-nigh 
amazing, but of course every method may be expected to have its limita- 
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tions. The reader interested in the method must be asked to refer for 
further information to literature prepared by the manufacturers of the 
machinery, or to one of the special books prepared on the subject. 

GRAPHICAL REPRESENTATION OF DATA 

The orderly manner of change in the class frequencies as one pro- 
gresses from the lowest to the highest class along the scale of measurement 
is reasonably well apparent from inspection in the first three of the fore- 
going illustrative tables. In Table 4, although N is very large in each 
case, definite obstruction to one^s appreciation of the orderly nature of 
the change in frequency is introduced because of the disturbing effect of 
the irregular grouping. A much better appreciation of the nature of 
the orderliness may always be secured through graphical portrayal of the 
data. Before embarking on demonstration of this, it may be well to 
consider first some points with respect to graphs in general. 

A statistical graph is a geometric drawing which aims to portray the 
nature of the relationship between two (or more) quantities which vary 
over a range of values. This portrayal of the relationship as a whole is 
effectively achieved through the replacement of number by dimension. 
A complete pattern of dimensions is readily assimilable by eye, whereas 
number comparisons are ordinarily appreciated only in pairs, and then 
as a result of processes in mental arithmetic. 

The effectiveness of any graph in presenting a desired picture depends 
predominantly on the clarity of design underlying it. Simplicity of 
plan is a prime essential to that effectiveness. Boldness of execution is 
probably next in importance. Clearness of line, scaling, and lettering, 
together with adequacy of descriptive legends, impart a quality to graphs 
for which* there is no substitute. Many a graph has been severely 
impaired in value through incorporating in it several ideas or so much 
detail that confusion results. Other graphs fail completely in giving 
the correct impression because of violation of the fundamental principle 
that the unit of dimension with respect to any axis must remain constant 
at all positions along that axis; otherwise dimension does not replace 
number in a simple manner. 

As drawings on plane surfaces, graphs in their simpler forms are con- 
fined to picturing the relationship between two variables only, usmg for 
the scales of those variables the two directions at right angles in terms of 
which surfaces are ordinarily measured. These two directions are 
called axes. They arc conventionally established in the so-called hori- 
zontal and vertical directions, or from left to right and from bottom to 
top of the page as it lies before one on a desk. The horizontal axis is 
known as the axis of abscissas, and the vertical the axis of ordinates. On 
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lines drawn in each of these two directions, scales may be measured off, 
and any point on the surface may then be referred to each of these scales, . 
separately and jointly. Thus the association of two magnitudes may be 
expressed by a single point, and repeated associations of such paired 
magnitudes may be portrayed by a succession of points forming a line 
or by a dispersion of such points over the surface. The relationship 
which two quantities varying over their respective scales bear to one 
another may therefore be portrayed clearly in terms of a simple surface 
to give a unified picture of the whole. 

All graphical dimensions are constructed from numbers. A graph 
cannot be any more accurate than the data from which it is constructed. 
It is only because a general view of problems involving many numbers 
may usually be secured much more readily from a suitable graph of the 
data that the graph is constructed. We may now look specifically into 
the problem of graphically portraying frequency distributions. 

BAR DIAGRAMS AND HISTOGRAMS 

Graphical representation of distribution of frequency along the scale 
of a variable may readily be given, one axis (conventionally the abscissa) 
being used for the scale of measurement of the variable, frequency being 
portrayed in terms of the other a^ds. For any discrete variable which 
is not seriated into broad classes, frequency is assembled at successive 
integers on the scale of whole numbers. These integers may be ascribed 
as values to corresponding successive points on the abscissal scale. At 
these points ordinates may be erected, each of length proportional to the 
frequency at that point. In the upper panel of Fig. 6 a graph so drawn 
for the data of Table 1 is reproduced. Although technically correct, its 
deficiency in boldness because of the thin frequency lines is at once appar- 
ent. Even what one sees with these fine lines is really an exaggeration, 
for technically a line has no thickness. Technicalities may, however, 
be swept aside quite properly at times. Exaggeration of the frequency 
lines into prominent solid bars would obviously do much to add to the 
practical effectiveness of the picture, without which a graph has no value. 
This is done in the lower panel of Fig. 6. Graphs of this nature are 
commonly called bar diagrams. 

In preparing to portray the frequency distribution of a continuous 
variable, one recognizes at once that a different situation exists. Fre- 
quency now is spread over a range; it is not to be represented by an 
ordinate at a point but by an ordinate moved over a range of the abscissal 
scale, that is, by a rectangular area. One may then ask: What does 
the scale of ordinates describe? The solution is really quite simple. If 
seriation be into classes of equal range, then all the rectangles designating 
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frequency will have equal bases and therefore the heights of the rectan- 
gles will be proportional to the frequencies per class range of • • * units. 
The ordinate scale is therefore one of frequency per • • • units on the 
scale of the variable. This seemingly technical refinement is most 
important to the correct graphical portrayal of a continuous frequency 
distribution. To ignore it will cause quite erroneous graphs to be drawn 
when the class ranges are irregular. Recognition here of a technicality 

FIGURE 6 

Freqtjency Disteibution for Number of Children in Completed Families, 
New Zealand, 1916 * 




Number of children per family 


is essential to the purpose of the graph itself. The frequency correspond- 
ing to each class range on the scale of measurement is portrayed by a 
rectahgle erected vertically above that range as a base and of area 
proportional to the frequency. 

The point giving the upper limit of the true range of one class in a 
continuous variable also specifies the lower limit of the next. Hence 
the graph will assume the form of a collateral array of rectangles on a 
common base line. The figure thus established is known among statis- 
ticians as a histogram (Greek: histos^ a web or tissue; gramnuif a thing 
written). 


For basic data, see Table 1, page 16. 
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In Fig. 2a the histogram is drawn for the distribution of length of 
left middle finger as classified on the right-hand side of Table 3 . Taking 
any one class, for instance that of greatest frequency, the 691 individuals 
in it have values lying between 114.5 and 117.5 mm. This frequency 
pertains not to any one point, but to the range designated. Correct 
representation of the distribution must spread the frequency over the 

TABLE 6a 

Brain Weights at Autopsy by Sex 
(Data of Retzius 


Class range 
in grains 

Frequency 

Males 

Females 

900- 949 


1 

950- 999 


— 

1,000-1,049 


1 

1,050-1,099 


2 

1,100-1,149 

1 

12 

1,150-1,199 

4 

16 

1,200-1,249 

13 

24 

1,250-1,299 

18 

26 

1,300-1,349 

35 

16 

1,350-1,399 

53 

14 

1,400-1,449 

43 

8 

1,450-1,499 

40 

6 

1,500-1,549 

21 

— 

1,550-1,599 

19 1 

1 

1,600-1,649 1 

11 


1,650-1,699 

3 


1,700-1,749 1 

■ I 


Totals 

i 

262 

127 


range to which it belongs. The scale of ordinates in Fig. 2a therefore 
portrays the frequency per 3^m. range of finger length. Since all 
classes in this histogram have the same range, the heights of the rec- 
tangles are proportional throughout to the class frequencies. 

In a study of brain weight at autopsy for individuals of both sexes 
between 20 and 50 years of age, Retzius ^ has given the scries of data 
reproduced as Table 5a herein. The reader may undertake to prepare 

^ A. Retzius. Ueber das Hingewicht der Schweden. Biol. Unteusuchiiiigen, 
N.F.Bd. 9, Cap. 4: 51-68. 1900. 
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histograms of the two distributions in an arrangement permitting of 
ready comparison. In this connection the following notes may be 
observed: 

(1) Conversion of the frequencies within each sex to a percentage 
basis will eliminate the effect of unequal size of the two samples. 
The total actual frequency should be indicated against each histo- 
gram in all such transfers to a relative frequency scale. 

(2) Confusion of the two histograms will result if one is super- 
imposed on the other. This may be avoided suitably by using 
two panels, one directly below the other in relation to a common 
weight scale. 

(3) Superimposed frequency diagrams of an approximate nature 
may be made by plotting each percentage frequency as a point above 
the appropriate class center value. When these points are joined in 
order an outline somewhat simulating a frequency curve is obtained. 
Such an outline is commonly known as a “frequency polygon.” 
The two polygons may be distinguished easily by using full lines 
for one and broken lines for the other. The reader should determine 
for himself the reason why the frequency polygon is designated as an 
approximate representation of the frequency distribution. 


TABLE 6 

Frequency Distribution for Age at Marriage of Bachelors and Spinsters 
IN England and Wales for the Year 1932 


(1) . 

Age group 

(2) 

Frequency 

(3) 

Class range 
(in years) 

'(4) 

Average frequency 
per year of age 

(5) 

True age 
range 

15-20 

13,372 

6 

2,229 

15-21 

21-24 

93,728 

4 

23,432 

21-25 

26-29 

115,113 

5 

23,023 

25-30 

30-39 

45,656 

10 

4,566 

30-40 

40^9 

6,337 

10 

634 

40-50 

60-59 

1,572 

10 

167 

50-60 

60-89 

438 

30 

1 

60-90 


Sometimes it is inconvenient, or it is not possible with the data given, 
to preserve uniformity of class range in a histogram. Take, for instance, 
the data on number of marriages in England and Wales, 1932 (first 
marriages for both parties), given in the first two columns of Table 6. 
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Preparation of a histogram to portray the distribution of age at marriage 
as correctly as these Unuted data permit would necessitate first some 
such calculation as is given in colunms (3) and (4) of the table. The 
histogram may then be drawn readily in terms of the figures in the last 
two columns of this table. It is presented as Fig. 7. The reader may 
test his grasp of this point by preparing histograms of the ^e distribu- 
tions for two elements of the white population of Minnesota as recorded 
in Table 4. For this purpose the greatest age may be considered as 105, 
and the group of unknown age may be ignored. 

FIGURE 7 

FeBQUBNCT DiSTBIBtJTION FOB AgB AT MaRBIAGE OF BACHELORS AJfD SWNSTBRS 
IN England and Wales, 1932 * 



15 25 35 45 55 65 75 85 

Age in Years 


THE FREQUENCY CURVE 

The advantage of coarsening the grouping beyond the unit of 
measurement in order to portray graphically the form of a frequency dis- 
tribution is quite pronounced whenever the total frequency is not many 
times larger than the number of increments of measurement between 
the smallest and greatest variate. Figure 8 illustrates the effect of 
coarsening the grouping progressively in smoothing out the histogram 
for a series of 402 new-born infant weights, measured originally to 
half ounces. This quite apparent smoothing as the class range is 
widened is due to the relatively smaller effect of errors of sampling 
on class frequencies as the latter are increased. 

Let it now be assumed that it is possible to increase the number in 
the sample indefinitely, and that accordingly the unit of grouping of the 
material may be reduced to a much smaller range without loss of the 
* For basic data, see Table 6, page 29i 
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general form of the distribution. Each rectangle in panel D of Fig. 8 
would then be split into a succesaon of much finer rectangles, not all of 
the same height or frequency but chan^g in a systematic manner from 
one to the other as do the coarse groups in the sample of limited size. 
This is illustrated in Fig. 9, where development of the concept is given 

FIGURE 8 

Frequency Dibtoibutions for Weight at Birth or 402 Female Infants * 



5 6 7 8 9 10 11 

Weight in pounds 


both logically and illogically in terms of the single broad class, 6.0 to 6.5 
pounds. With an infinite number of such infant weights, and an incre- 
ment of measurement of infinitesimal range, a frequency distribution 
bounded apically by a smooth curve must result. The type given by 
the flowing line in panel D of Fig. 8 would probably be obtained. This 
frequency\curve has been fitted in accordance with critical statistical 
procedure, and does not represent free-hand graduation or reflect the 
* For basic data, see Table 9, page 57. 
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influence of biological judgment arising from wider experience with this 
particular variable. As far as the given series of data alone may form a 
satisfactory basis for generalization; the curve depicts a statistical 
estimate of the form of distribution in the supply from which the sample 
has been drawn. It will be noted that this series of infant weights sug- 
gests a distinctly non-symmetrical distribution for this variable. 


FIGURE 9 


Illogical and Logical Sub-Division of Frequency in Passing from 
Coarse to Finer Grouping in a System of Very Large Frequency 

(Illustrative Segments from Panel (D) of Figure 8) 


Illogical 


5 5.5 6 6.5 

Weight in pounds 



Weight in pounds 


SYMMETRY 

Frequency systems may primarily be divided into two main groups, 
the symmetrical and the skew. Figures 1 and 2 belong to the former 
group wherein positive and negative deviations from the central value 
have equal frequency. This balance of frequencies about the central 
value is not realized in the distribution curve of Fig. 8, which is desig- 
nated as skew. One might say that in skew distributions one “tail” of 
the curve is more “drawn out” than the other. Skewness may occur in 
either direction; that is, it may be either positive or negative. These 
terms for direction of skewness are related to the direction in which the 
“long tail” extends. When that tail is toward larger values than the 
average it is called positive, and vice versa. 

Skewness may be — although it rarely is in biology — of such mag- 
nitude as to change the form of the distribution from the I -shaped curves 
already presented, to forms designated for convenience as J- and skew 
U-shaped. The l-shaped curves are those in which the peak of frequency 
occurs between the limits of the distribution, whereas in the J forms a 
single peak occurs at one end of the distribution, and in the U forms a 


KURTOSIS 


33 


peak occurs at each end of the distribution. There is, of course, a 
complete transition scheme for l-shaped curves from extreme skewness 
in one direction, through symmetry, to extreme skewness in the opposite 
direction. The J and skew U curves represent still more extreme 
departure from the central symmetrical I form. The sketches of 
Fig. 10 will serve to indicate the transition scheme. 


FIGURE 10 

Illustrating the Transition Scheme in Skewness 
OF Frequency Distributions * 


I curve 
L positive skewness 





KURTOSIS 

Symmetrical curves vary with respect to the relative concentration 
of the frequency at the center of the distribution. Small differences 
between curves in this regard are not so readily detectable as are those 
of skewness, although the characteristic is obvious when extreme forms 
are compared This characteristic may be described as degree of peaked- 
ness. It is known technically as hurtom, from the degree of curvature 
* All curves drawn are from graduations of real variables. 
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at the peak. Positive (or lepto-) kurtosis^ mesokurtosis (that of the 
of error^^ or norinai curve); and negative (or platy-) kurtosis mean 
simply that the clustering at the center is respectively greater than, 
equal to, or less than that of the normal curve. The U-shaped sym- 
metrical curve represents an extreme form of negative kurtosis, the 
rectangular distribution being somewhat less so, but this is too technical 
a subject for detailed consideration at this point. The sketches of 
Fig. 11 illustrate a wide range of kurtosis. 


FIGURE 11 

Illustrating the Transition Scheme for Kurtosis 
OF Symmetrical Frequency Distributions 



BIOLOGICAL VARIATION 

Although variation may express itself in any one of a continuous 
array of forms departing from the central type in all degrees of skewness 
or kurtosis, biolo^cal variables for the most part do not differ very 
markedly from the ^^normal curve” or “law of error” form. While 
relatively small degrees of skewness and kurtosis are very common, large 
deviations from normality are the exception rather than the rule. This 




BIOLOGICAL VARIATION 

has a most important bearing in simplifying statistical procedures for 
the biologist. 

Quantitative description of the characteristic features of frequency 
distributions, which may be extended to any number of dimensions of 
space, comprises a basic task in statistical analysis. Readily compre- 
hensible numerical quantities must be devised which will describe these 
features accurately, so that comparisons of them in different variables 
may be accomplished with ease and in an objective manner. Attention 
will now be given to the simpler of these problems. 



CHAPTER 3 

TYPICAL VALUES 

Suppose a series of magnitudes; xi, X 2 , xs, - • xn, all recognized .s 
belonging to the same variable; to have been recorded in quantitative 
description. If all these individual values are identical, it is not neces- 
sary to determine mathematically an expression to represent them. The 
value required would be obvious by inspection. Any one value of x 
would be entirely representative of the whole series. 

Complete concordance of all values of a variable is rarely, if ever, 
encountered in biological work. A situation of perfect concordance 
would (or should) immediately incite suspicion that the observations 
are biased. Indeed, in all fields of science, as the instruments for 
measurement become more refined it is increasingly apparent that 
things which may at first have seemed identical really differ among 
themselves. Under such conditions no single individual measurement 
can be accepted as being wholly representative of the entire sample. It 
becomes necessary to obtain a description of the series as a whole that 
will take into account all the measures available. A first step is to 
establish a typical value for the variate magnitudes — a single value 
which will be representative of the series of magnitudes in aome speci- 
fied and appropriate manner. 

THE ARITHMETIC MEAN 

The average or mean value is so familar as a typical magnitude that 
it seems superfluous to discuss its functions; and yet, like much that is 
commonplace, its philosophical implications rarely receive consideration. 
It provides a measure of type, the usefulness and appeal of which are 
attested by the extensive service which it has rendered since calculation • 
first began. It is simple 1)0 th to comprehend and to compute. It is 
perfectly impartial and unselective, for every observation enters into its 
magnitude without any trace of distortion. All who have employed it 
— and who, in fact, has not?— have in truth ascended to the first rung of 
the ladder of statistical analysis. 
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A very useful sjnubol for the mean value of a variable is provided 
by placing a bar above the letter tUe It i \s that 

variable, then the symbol for the mean, x (customarily verbalized as 
‘ X bar ), finds some preference over such alternatives as Mx (mean of r) 
or A* (average of x) because of its avoidance of the subscript notation. 
In algebraic definition of x, one may write, 

_ _ + X2 + 3:3 + • - + 

^ iV 


The Greek letter ^ (pronounced “sigma’’), which is the equivalent of the 
English capital letter denotes summation for the entire series of values 
indicated by the symbols that immediately follow it. It is merely a 
mathematical shorthand symbol for continued addition of all values 
corresponding to the one given. When the equation is rearranged to the 
form 


X = 


n'^ ^~N 



the mean appears as a sum compiled from one Ath part of every magni- 
tude considered. Each individual in the series is therefore represented 
in the mean in proportion to its magnitude. This property defines the 
sense in which to the lay mind the mean logically typifies any series for 
which it is calculated. 

Technically, equation (1) defines the arithmetic mean of a series. 
Two other types of average generated by the arithmetic mean on trans- 
formed scales and therefore appropriate to those special conditions justi- 
fying the transformation are the geometric and har7nonic means, defined 
respectively by G and H in the equations : 


, ^ 2 (log a;) ^ 

log G = ~ — ;; — , or u = V xi X 2 • 


= 


N 
N ' 


1 


1 


H N IxY 


X2 XyJ 


Since these statistics pertain to special types of analysis of relatively 
rare occurrence in biometric work, further discussion of them will not 
be given here. When the term mean is used without further designa- 
tion, the arithmetic mean is always implied. 
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A fundamentally different approach to the problem of establishing 
a representative value for a series of variates may be made through selec- 
tion of some particular variate or group of variates as typical of the 
whole. Two such selections came into rather common usage in the days 
when calculating machines were very much more of a luxury than they 
are now. These two descriptive values became more or less obvious as a 
result of organization of the data and do not depend primarily on cal- 
culation. They merit careful consideration at the present time, not as 
short-cut approximations to the mean for which purpose they appear to 
have been widely used, but as quantities descriptive of type in different 
senses from the mean. One may perhaps more readily grasp the 
descriptive value of these quantities through derivation of them directly 
from actual data. 

THE MEDIA!? 

In a study of the er 3 ^hrocyte count (red blood cell count) for normal 
men, Haden ^ gives findings for 40 individuals. Arranged in order of 
magnitude, these counts were as follows: 


4.27, 

4.32, 

4.40, 

4.52, 

4.56, 

4.58, 

4.64, 

4.70, 

4.72, 

4.73, 

4.80, 

4.80, 

4.80, 

4.80, 

4.84, 

4.87, 

4.89, 

4.93, 

4.97, 

4.98, 

4.99, 

5.00, 

5.02, 

5.05, 

5.09, 

5.09, 

5.10, 

5.15, 

5.16, 

5.20, 

5.20, 

5.20, 

5.26, 

5.28, 

5.36, 

5.46, 

5.49, 

5.50, 

5.57, 

5.62, 


millions per cubic millimeter.^ 

This arrangement of a series of values in order of magnitude is known as 
ranking. It enables one to determine quickly how many individuals 
fall below and above any given magnitude. The problem is to select a 
magnitude which is representative of the set. One mhy endeavor to 
do this with the idea of a central value in mind. Since the above series 
has an even number of variates, there is no “middlemost’^ individual. 
However, 20 individuals have a count not exceeding 4.98, and 20 have a 
count not less than 4.99. The mid-point between these two magnitudes 
may logically be selected as the middlemost magnitude. It is known 
technically as the median value. Had N been odd, then there would 
have been a middlemost individual, giving a unique solution to the prob- 
lem of ascertaining the value above and below which an equal number of 
observations occur. Of course the situation would become complicated 

^ R. L. Haden. Accurate criteria for differentiating anemias. Arch, Int. Med., 
31 : 766 . 1923 . 

^ Note that the count is made under a microscope in a volume equivalent usually 
to only one ten-thousandth of a cubic millimeter of actual blood. The result is then 
multiplied by 10,000 and the count reported as per cubic millimeter, hence the 
jumps by a smallest increment of 10,000 in the figures recorded above. 
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again if several variates had that same magnitude, but we need not pause 
to argue this point. The median may be accepted as being representa- 
tive in the sense that it divides the total frequency into halves; it is 
central with respect to the distribution of the total frequency along the 
scale. 

As the central value with respect to frequency distribution, the 
median has the peculiar property of being unchanged by alteration of 
any or all of the other variate magnitudes in the scries, provided that 
the sign of deviation from the median remains unchanged in each case. 
Stated inversely, the numerical magnitudes of the individual deviations 
from the median in a sample do not influence the median for that sample. 
Thus the median is often quite useful as a measure of “central tendency^' 
in a sample where it is desired to suppre&s the influence of the extreme 
or unusual variate magnitudes. It is a magnitude representative of 
mediocrity, serving tO' eliminate the influence of the geniuses and the 
morons for the student of educational statistics, or the influence of 
“epidemics” and periods of unusual good health for the vital statistician, 
and so forth. 


THE MODAL VALUE 

The organization of a series of data by ranking the variates places 
identical magnitudes together and automatically draws attention to the 
relative frequencies of occurrence of the observed values. It is a logical 
stepping stone to the form of organization which we have already con- 
sidered under the title of seriation. Just as ranking naturally centers 
attention on the number of observations above and below any specified 
point, so seriation emphasizes the importance of successive values with 
respect to frequency of occurrence. The class of greatest frequency 
naturally draws attention on this account, if the class ranges are uni- 
form. It is called the rnodal or “most fashionable” class, the class center 
being commonly designated as the modal value- This provides immedi- 
ately a value which is typical of the observations in a new and quite 
appropriate sense. 

It is well to recognize at once that establishment of the modal value 
is fundamentally associated with a reasonably satisfactory definition of 
this point in the appropriate law of frequency distribution for the variable 
considered. If any seriation does not lead to a reasonably smooth fre- 
quency distribution, the question arises at once of the acceptability of 
the indicated modal value. Now “smoothness” of a histogram is a func- 
tion of size of sample and the unit of classification. This unit is chosen 
quite arbitrarily, and the indicated modal value may be shifted substan- 
tially according to this selection, even when N is quite large. 
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The difficulties of selecting the appropriate modal value by seriation 
of a small sample are often overwhelming. Let us look for a moment at 
the series of 40 erythrocyte counts above. The count of 4.80 occurs 
four times and is the most frequent individual value as the data are 
given. But what would one do about selecting the modal value if a 
recheck of the counts showed an error, one of the 4.80 values being 
changed, say, to 4.81? In that case both of the counts 4.80 and 5.20 
would occur with a frequency of 3, giving two modal values rather 
widely separated. Coarser grouping may be resorted to, but the 
indicated modal value will be found to vary around 5.00 according to the 
seriation scheme adopted. Study of the histograms for the same series 
of observations in the successive panels of Fig. 8^ will quickly substantiate 
the general principle. As applied to a small sample, the modal value 
has rather poor descriptive capacity. From what has been said in the 
preceding chapter it is easy to see that this deficiency will diminish as 
the size of the sample is increased and its frequency distribution accord- 
ingly reflects more clearly the supply form. 

DESIRABLE QUALITIES OF A DESCRIPTIVE STATISTIC 

These preliminary definitions of the mean, median, and modal values 
as independent and differing descriptions of typical value raise the 
question of the relative quahty of alternative descriptions. It has been 
stated that quantitative description of the characteristic features of sets 
of data comprises a basic task in statistical analysis. Numerical quan- 
tities must be devised which will describe these features accurately, so 
that comparisons of them in different variables may be readily accom- 
plished in an objective manner. The term statistic is now commonly 
being appUed to any summarizing quantity such as a mehn, median, 
et cderaj derived from and descriptive of a set of variates. Using the 
term in this sense we may now well inquire as to the desirable qualities of 
good descriptive statistics in general. The following list is suggested; 

(1) They should be based upon every variate in the sample. 
Assuming the basic data to be accurate, surely each measurement 
is just as important as aiiy other measurement in defining the variable. 

(2) They should be capable of algebraic definition. Algebra 
may be considered as providing the vehicle of symbolic definition 
for the development of logical thought sequences applicable to mag- 
nitudes differing by finite amounts. The utility of any statistic as a 
factor in such logical development of thought will depend on the 
generality of its algebraic definition. 

* Vide page 31. 
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(3) They should be readily comprehensible with respect to each 
descriptive function. 

(4) They should be susceptible as far as possible of rapid deter- 
mination. 

(5) They should yield highly consistent values when applied to 
samples randomly drawn from the same supply. 

While the mean fulfills every one of these conditions remarkably well, 
the same cannot be said of the modal and median values with equal 
validity. The latter are not capable of algebraic definition in general 
terms, nor do they depend directly for their values on the magnitudes of 
every variate in the series. Also, the consistency of means in random 
samples from the same supply is found to be considerably greater than 
that of median or modal values. If these three statistics are to be con- 
sidered as alternatives in description, there can be no question as to the 
superiority of the mean. It should not be overlooked, however, that 
each has arisen as the result of a different point of view in approaching 
the problem of establishing a representative value. The problems in 
which they assume quite different magnitudes must be regarded as prob- 
lems in which they are not alternative, each having its own descriptive 
function. 


“ CENTERING VALUES ” OF FREQUENCY CURVES 

The preceding syntheses of the mean, median, and modal values as 
representations of typical magnitude referred to the law of frequency 
distribution only in the case of the modal value. , The descriptive qual- 
ity of the modal value was found to be very poor with small samples, but 
increased as refinement of seriation could proceed without loss of the 
form of the frequency distribution, that is, as the sample increased in 
size. It is only in the limit of a frequency curve being reached that the 
point of greatest frequency becomes independent of the various seria- 
tions to which the data may be subjected. It was urged by Karl 
Pearson that the term mode be reserved for the scale value at which the 
greatest ordinate occurs for the fitted frequency curve. Our previous 
use of the variant term modal class for the range within which maximum 
frequency occurs in a seriation was made with this distinction in mind. 

When a histogram is graduated by a frequency curve, no attempt is 
made to force the mode of the curve to agree with the modal class of the 
histogram. The constants of the equation used fix the point at which 
the maximum ordinate will occur. These constants are determined from 
general properties of the observed data and are essentially independent 
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of any seriation made. The curve given in panel (D) of Fig. 8^ must 
logically graduate each of the seriations given equally well. The vari- 
ous modal values of the several seriations contrast sharply with the fixed 
mode of the curve at 7.00 pounds. 

The mean and median of a frequency curve are established precisely 
in accordance with the definitions already drawn for the variates. The 
median becomes the point on the scale of measurement at which the 
ordinate cuts the frequency curve into two segments of equal area, the 
latter— as we have sought to make clear-defining the frequency. It 
will be shown in Chapter 5 that the algebraic definition of the mean 
(equation 1) fixes it as the “point of balance” of a frequency system. 
That is, if a frequency distribution bounded apically by a curve is cut 
out of cardboard of uniform thickness and is balanced horizontally about 
an ordinate, then that ordinate will emanate from the base line at the 
mean value. It follows directly then that positive and negative devia- 
tions of the variates about any point in the scale of measurement of a 
variable balance one another in toto only when that point is taken at the 
mean of the distribution. This defines another attractive quality of the 
mean as a representative value. 

For symmetrical frequency curves, the median, the mode, and the 
mean all coincide, as must surely be obvious. In a skew curve this is 
not so. The point of balance (the mean) must necessarily be drawn 
increasingly away from the point of greatest frequency (the mode) 
toward the long “tail” of the distribution as skewness intensifies. The 
amount of separation between them accordingly forms an excellent 
basis for quantitative description of the degree of skewness present in a 
frequency distribution. 

The median of non-symmetrical I -shaped curves falls tween the 
mode and the mean, somewhat closer to the latter than the former. 
For small degrees of skewness it holds very closely to the approximating 
equation 

(mode - mcan)(=) 3 (median — mean). 

Solving this equation to provide quickly an estimate of the mode after 
mean and median have been determined from a set of observations is 
perhaps acceptable as a rough approximation only if the sample is a large 
one and skewness is not very marked. Rather extensive calculation is 
involved in accurate determination of the mode, for the equation to the 
graduating curve must be fully determined first. Because of this there 
is temptation to use the above approximation without proper regard to 
its limitations, a temptation to be avoided, of course. 

* Vide page 31. 
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In that which follows in this book we shall assume, as indeed we must 
to progress satisfactorily, that simple computations present no obstacle. 
It will be shown in the following section that a powerful device for the 
simplification of calculations may be mastered easily, placing the bulk of 
statistical work within the scope of elementary mental arithmetic. In 
view of this and the reasonable approach to symmetry commonly found 
in biological variables, the natural advantages of the mean as a typical 
value and reference point for the measurement of variation will cause its 
adoption in preference to the median for practically all work. 

CODING AS AN AID IN COMPUTATION 

The mean must be determined by calculation. Algebraically, this 
seems a most simple procedure, and indeed it is, provided 

(1) that the number of variates to be summed is not so large as to 

cause fatigue when handled individually, and 

(2) that the variate magnitudes are numbers of very few digits. 
If either of these difiiculties is present, they may readily be circum- 
vented. The first may be avoided by seriation to form a frequency 
table. The second is very simply overcome by coding ^ which consists of 
replacing the true variate magnitudes by a simpler set of numbers 
according to a systematic scheme. The original scale of measurement 
may be replaced by an arbitrarily chosen scale of the simplest possible 
form in accordance with a well-known mathematical principle. All 
computations may be made in terms of the arbitrary scale and the 
results quickly transferred to their corresponding original scale values. 

The objectives of seriation have already received attention in these 
pages. It may not be amiss at this point, however, to consider certain 
procedural* details as they relate to the establishment of the computa- 
tional table. Solving the following practical problem will serve as a 
medium for development of these points, as well as others incidental to 
the subject of coding. 

In routine analysis of the protein content of grain arriving at a port, 
1,937 determinations are made available. The protein content of each 
sample is expressed on a percentage basis to the nearest 0.05 per cent. 
The lowest value is 10.30 per cent, and the largest is 17.15 per cent, prac- 
tically all intermediate values by increments of 0.05 occurring. How 
may the computation of the mean for such a series of values be most 
advantageously simplified? 

The very large number of variates to be considered makes seriation a 
practical necessity. Since the increment of measurement is 0.05, there 
will be 138 possible values between the extreme limits of 10.30 and 17.15. 
Grouping into classes of sufficiently wide but uniform range to reduce 
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the number of classes to about 12-16 would be desirable. A class range 
of 0.5 per cent protein would give approximately 14 classes. That range 
is so convenient for seriation purposes that it may well be chosen. 

'‘At what value should the beginning class boundary be set?” The 
reply to this question must be that, like selection of the class range, this 
matter remains entirely at the discretion of the individual. Since the 
class centers become the reference values after the seriation is made, 
some workers would aim to choose the boundaries so that the centers 
become simple numbers, like the even and half percentages. This 
selection would place the class boundaries at values ending* in .25 and 
.75, and the question would arise as to what to do with^'the large numbers 
of variates which fall precisely on these boundaries. Having the true 
class range an odd multiple of the increment of measurement avoids this 
difficulty, but, having due regard to ease of seriation, that would be incom 
venient here, so we shall proceed with the suggested class range of 10 
increments of measurement. 

Having as simple a pattern of numbers as possible for the apparent 
class boundaries greatly facilitates seriation. Thus apparent class 
ranges of 10.0 - 10.45, 10.50 - 10.95, 11.0 - 11.45, et cetera, will lead 
to quicker seriation than possibly any other scheme with a true class 
range of 0.5 unit. We shall see later that it is quite immaterial to the 
calculations what the complexity of the class centers may happen to be, 
so we shall seriate according to this plan. Table 7 presents in its first 
two columns the frequency distribution found for the 1,937 variates in 
the series. 

Since the measurements were recorded to the “nearest” scale gradua- 
tion, the mid-point of the apparent range for each class is identical with 
the center of the true class range. These centers are entered as column 
(3) of the table. For computational purposes they will replace the 
ranges, the effect of the seriation being to represent the data as if they 
had been recorded by increments of 0.5 per cent instead of 0.05. 

Proceeding now to compute the mean value for this distribution, one 
must first multiply each class center, x, by the number of individuals in 
the class, /, to secure the sum of all variate magnitudes within the class. 
Column (4) contains these products, the grand total of the column being 
Sa; for the entire distribution. By dividing this total by 1,937, the mean 
protein content is found to be 13.233 per cent. 

The reader who has taken the trouble to verify the calculation of the 
mean in Table 7, if he did not use a machine for multiplications, will 
have been impressed by the onerous mental arithmetic imposed by the 
five-digit numbers for the class centers. It is the particular function of 
coding to reduce this labor to a minimum through replacement of the 
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existing set of class center or x values by a simpler set. In practice, 
coding is a most direct and elementary procedure, used in one form or 
another on occasion by all. It would seem advantageous first of all to 
trace the underlying steps with respect to a very simple illustration and 
then give full generality to the principle in algebraic form. 


TABI.E 7 

A FBEguENCT Distribution of Protein Analyses, with Calculations for the 
Mean Value 


(Original determinations to the nearest 0.05 per cent protein) 


(I) 

(2) 

(3) 

(4) 

Apparent class 

Frequency 

Class center 

Class total 

range 

/ 

X 

fx 

10.0-10.45 

4 

10.225 

40.900 

10.5-10.95 

18 

10.725 

193.050 

11.0-11.45 

65 

11.225 

729.625 

11.6-11.95 

157 

11.726 

1,840.825 

12.0-12.45 

270 

12.225 

3,300.750 

12.6-12.95 

327 

12.725 

4,161.075 

13.0-13.45 

334 

13.225 

4,417. 150 

13.5-13.95 

276 

13.725 

3,788.100 

14.0-14.45 

212 

14.225 

3,015.700 

14.5-14.95 

141 

14.725 

2,076.225 

15.0-15.45 

79 

15.225 

1,202.775 

15.5-16.95 

31 

-15.725 

487.476 

16.0-16.45 

14 

16.225' 

227.160 

16.6->6.95 

7 

16.725 

117.075 

17.0-17.45 

2 

17.225 

34.450 


1937 


25,632.325 

Iz = 

25,632. 325. 

N = 1,937. 


X = 

13.233. 




Probably any science student asked to give the average of the three 
numbers 19.01, 19.06, and 19.08 would immediately recognize that, 
since 19.0 is common to all, the average must be 19.0 followed by the 
average of 1, 6, and 8. He would be coding the data in so doing, finding 
the average on a new scale, then decoding his answer to get 19.05. Let 
us follow the steps in detail. 




46 


TYPICAL VALUES 


19.01 + 19.06 + 19.08 _ 19 + O.OljM^+^^6 + 19 + 0.08 
3 “ 3 

_ 3(19) 0.01 + 0.06 + 0.08 

” 3 3 

^ 0 . 01(1 + 6 + 8 ) 

3 

-i«+ 0.01(f) 

= 19 4- 0.01(5) 

= 19.05. 

“But/' he might protest, “I didn't go through all that manipulation!" 
Perhaps what he should have said was that he did not realize that his 
short cuts were based on just that type of manipulation. 

Let us now generalize these steps. Let x designate the original scale 
values, a be the constant 19, and h the constant 0.01. The numbers 
actually averaged (1, 6, and 8) were derived from the x values by sub- 
tracting 0 from each, then dividing the remainders by h. Designating 
the resulting code values by x, we have 

X 

or X 

Now X 


Therefore the mean value on the ori^nal scale of measurement may be 
found by securing the mean value in terms of a code scale and then 
decoding. 

The reader who does not recall the algebraic rules of rearrangement 
about the summation sign, S, may easily verify each step of this deriva- 


2 ; — a 

■ 0 + 5x. 

'' N 

S(a + 6x) 
N 

Xa Xbx 

, ^ 
N 

a + bx. 


' d b ~ 


(3) 
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tion by replacing Zx in the first line by its equivalent full form, 
+ 2^2 + * ■ ' + aiiv. He will in this manner establish the following 
rules for himself: 

(1) When summation of an expression involving terms joined by 
plus or minus signs is called for, the summation may be made by 
individual terms. 

S(o + hx — y) = Sa + 26a; — Zy, 

(2) Summation of a constant is equal to N times that constant, 
if the summation is over N elements. 

Za = Na. 


(3) Summation of the product of a constant and a variable is 
equal to the product of the constant and the summation of the 
variable. 

Zbx = 62a;. 

Note also: 


(4) Summation of the product of two variable magnitudes cannot 


be rearranged. 


We may now return to the frequency distribution of the protein 
determinations, reproduced anew in the first two columns of Table 8. 
Can we reach a simple scale to replace the awkward x values by coding 
according to the foregoing scheme? Very simply! First take one of 
the values of x and use it as the constant a. Two such selections are 
used in the table: (1) ai is taken as the lowest value of x; (2) a 2 is taken 
as a value near the center of the series, usually the modal class value. 
The respective series of residuals, x ~ are given as columns (3) and 
(5). Now take as the constant 6 the interval on the residual scale, which 
is of course the true class range on the original scale. Dividing the 
residuals by 6 in each series gives the code scales Xi and X 2 of columns 

(4) and (6). The contrast of the simpheity of these code scales with 
the original class centers surely needs no emphasis. 

Statistical organization of the data has made possible the expression 
of magnitude for the variable in the simplest possible form for computa- 
tional purposes. The new scales, although arbitrarily determined, do 
not lack descriptive value in the least. For instance, 8 on scale Xi 
means 8 steps of 0.5 above the value 10.225, and ~3 on scale X 2 means 3 
steps of 0.5 below the value 13.225. 

The mean value of the series on each new scale may now be secured 
with easy mental arithmetic. Proceeding as on Table 7, the sums of 
variates within classes, fxi and /x 2 , are given in columns (7) and (8) of 
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Table 8. Dividing the grand totals in each case by the total number of 
cases (1,937) gives ^ 6.016, and X 2 = 0.016. 

TABLE 8 


Illustration of the Principle of Coding by Linear Transformation of 
Scale for a Frequency Distribution of Uniform Class Range 
(Data of Table 7) 


Class 

centers 

Fre- 

quency 


Coding schemes 

*. 

Class totals 

CD 

(2) 

ai = 10.225 

6=0.5 

02 = 13.225 

6 = 0.6 

X 

/ 

x — ai 

Xl 

Z— 02 


/Xl 


(1) 

(2) 

(3) 

(A) 

(5) 

(6) 

(7) 

C8) 

10.225 

4 

0.0 

0 

-3.0 

-6 

0 

- 24 

10.725 

18 

0.5 

1 

-2.5 

-5 

18 

- 90 

11.225 

65 

1.0 

2 

-2.0 

-4 

130 

-260 

11.725 

157 

1.5 

3 

-1.5 

-3 

471 

-471 

12.225 

270 

2.0 

4 

-1.0 

-2 

1,080 

-540 

12.725 

327 

2.5 

5 

-0.5 

-1 

1,635 

-327 _1712 

13.225 

334 

3.0 

6 

0.0 

0 

2,004 

0 

13.725 

276 

3.5 

7 

0.5 

1 

1,932 

276 

14.225 

212 

4.0 

8 

1.0 

2 

1,696 

424 

14.725 

141 

4.5 

9 

1.5 

3 

1,269 

423 

15.225 

79 

5.0 

10 

2.0 

4 

790 

316 

J5.725 

SI 1 

5.5 

1 

2.5 I 

^ j 

341 

155 

16.225 

J4 

6.0 

12 

3.0 

6 

168 

84 

16.725 

7 

6.5 

13 

3.5 

7 

91 

‘ 49 

17.225 

2 

7.0 

14 

4.0 

8 

28 

16 








+1743 

Totals 

1,937 


11,653 

+31 

Means 


6.016 

0.016 


According to these values the mean is 6,016 steps of 0.5 above 10.225, 
or 0.016 step of 0.5 above 13.225. In algebraic form, 


X = a + bx. 

That is, X = 10.225 + 0.5(6.016) 

- 13.233; 

or again, x = 13,225 + 0.5(0.016) 

= 13.233. 
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As an exercise the reader may calculate x for the finger-length series 
given in Table 3/ first using the true scale without coding, then using one 
or both of the coding schemes just given. If Table 2 is now referred to,® 
it will be observed that subtraction of a constant a is all that is necessary 
to yield a desired code scale in that case. This amounts to the constant 
b being taken as unity, so that x is equal to a plus x. In Table 1’ the 
original scale is itself of the simple type; any coding, if undertaken, 
would be confined to taking a equal to 7, say, so that each multiplication 
would involve a one-digit figure. 

The reader will quickly perceive that the columns (3) and (5) of 
Table 8 are quite unnecessary in routme coding of the x scale in a fre- 
quency distribution table, provided that the seriation is uniform. 
Whichever type of code scale is desired may be written in directly 
and the values then recorded for the constants a and b involved. Thus, 
a is the value on the original scale corresponding to zero on the code 
scale, and h is the true class range (or interval between class centers) on 
the original scale. 

The advantage of taking a near the middle of the series of class 
centers, giving the smallest values possible on the code scale, is offset 
to some workers by the resulting introduction of minus signs into the 
computations. When the number of classes is much in excess of 12, 
however, a code scale with zero toward the center may facilitate mental 
arithmetic quite considerably. Some workers using code scales with 
negative values choose to make a guess at the mean on the original scale, 
then take a as the x value immediately below this. If the guess has been 
good, X will be a very small but positive value. Should x prove to be 
negative in any case, however, it must be remembered that its sign is 
essential to it in any further calculation. Thus, 

a + 6(-x) ^ a - hx, 

A seriation involving irregular grouping often makes full coding 
according to the above scheme of very httle if any value. However, the 
first step alone of subtracting a suitable value a may help computations 
considerably. This abbreviated coding is often used when seriation of 
data is not worth while and a constant may be subtracted readily by 
mental arithmetic as each variate magnitude is handled. In such cases, 
a is best taken as a round number somewhat less than the lowest variate 
so that mental subtractions will give immediately obvious residuals. 


^ Vide page 19. 


® Vide page 17. 


Vide page 16. 



CHAPTER 4 


THE MEASUREMENT OF VARIATION 

Francis Galton wrote, in his classical book “Natural Inheritance' 

It is difficult to understand why statisticians commonly limit 
their inquiries to Averages, and do not revel in more comprehensive 
views. Their souls seem as dull to the charm of variety as that of ■ 
the native of one of our flat English counties, whose retrospect of 
Switzerland was that, if its mountains could be thrown into its lakes, 
two nuisances would be got rid of at once. An Average is but a 
solitary fact, whereas if a single other fact be added to it, an entire 
Normal Scheme, which nearly corresponds to the observed one, 
starts potentially into existence. 

One would have little ground for castigating statisticians today in 
the words used by Galton nearly 50 years ago, and yet his eloquent por- 
trayal of the loss sustained by failure to describe variation has imparted 
lasting quality to that once-merited criticism. To attempt to describe a 
series of magnitudes solely in terms of some typical value and thus sacri- 
fice all knowledge of their variability is tantamount to obliteration of a 
most crucial feature of the character being measured. As a basis of 
generalization in science, each sample statistic such as a mean has its 
degree of trustworthiness which depends most fundamentally on the 
amount of variation present in the recorded values from whiefi it is com- 
puted. How enriched an average becomes when a measure of its sam- 
pling stability is added to it! The biologist should certainly not fail to 
measure the variation he encounters in his quantitative observations. 
Thus may he learn to defend himself against both the embarrassments 
of overhasty inference and the losses of overwrought caution. 

RANGE 

Reference to the amount of variation shown by the graph of a fre- 
quency distribution would incite contemplation of the “amount of 
spread” of the distribution as it may be read in terms of the base scale. 
The range of variation, or difference between the minimum and maximum 
values in any seriation, might well be offered as a first measure of varia- 
tion. In terms of a histogram this is a finite measure, but if attention is 
given to frequency curves the appeal of the range fades away like the 
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frequencies at each end of the distribution. In the supply portrayed by 
the curve, just where does frequency stop? In most cases there is noth- 
ing harder to determine with real satisfaction than the answer to this 
question. 

The range of any finite sample is of course quite definite. But what 
is one to do about the outlying case which appears so commonly to 
plague the investigator with conjecture as to its “really belonging” with 
the rest of the individuals in the sample? Exclusion of that one measure- 
ment may change the range markedly. The importance which the 
extreme individuals assume in this portrayal of variation is overwhelm- 
ing. Is it of no consequence at all where the intermediate variates fall? 
One faces many such disturbing problems in careful analysis of the 
descriptive quality of range as a measure of variation. The need for a 
fundamentally sounder approach to the general problem soon becomes 
manifest. 

INDIVIDUAL DIFFERENCES 

Variations are observed through recognition of differences between 
individuals. Description of variation may on occasion be undertaken 
in that way. Variation in the characteristics of the members of a fam- 
ily may well be, indeed generally is, measured by direct comparison of 
each individual with every other. The justification of this method is 
due chiefly to two factors: first, a personal interest existing in each 
member as an entity; and second, most families are of distinctly limited 
number and the comparisons do not become onerous. The existence of 
such factors as these represents a very special situation when one turns to 
study of variation systems in general. 

Algebraic analysis readily shows that among IV" individuals the num- 
. . N 

her of different comparisons possible is equal to — (A" — 1). When there 

are but 20 variates, 190 individual differences appear, and 100 variates 
would yield 4,950 comparisons. The difference between the minimum 
and maximum variates — the range — is of course just one of the set in 
each case. To attempt summary of individual differences by consider- 
ing all possible pairs is obviously out of the question. 

It may well have occurred already to the reader that the set of 
W — 1 deviations of one selected variate from all others in any series 
must give all the information incorporated in the much larger set of 
N 

— (W — 1) individual differences. The scientist is not interested in the 

latter values as such, but in the amount of variation characterizing the 
N variates. But if the deviations from some one variate are taken, 
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which one would be most appropriately chosen as this reference stand- 
ard? This is a troublesome question. No one variate is clearly the 
best. Indeed, are not all equal in importance? 

After all, the value to be chosen for this attack on measuring varia- 
tion is to be a standard of reference. Need it be one of the N individuals 
of the series? What could provide a better standard than a value repre- 
sentative of all; for instance, one of the typical values discussed in the 
preceding chapter. What would be more logical than that, having 
established a typical magnitude, one should turn attention to variation 
in terms of deviations of the individuals from that magnitude? Con- 
sideration of the suitability of the three typical values for the purpose 
quickly eliminates the modal value, leaving the issue of choosing between 
the median and the mean. Of these, only the mean is fully representa- 
tive of all variate magnitudes, for the median is just central with respect 
to total frequency. One may thus be guided to the mean as the stand- 
ard of reference. 


DEVIATION FROM TYPE 

Just as the variates, xi, X2f xs, • • formed the basis for the 
measurement of type, so the quantities 

{xi “ x), {X2 - x), (x3 “ ■ • • , - x) 

provide a starting point for the measurement of variation or deviation 
from type. For brevity, one may designate the deviation of each 
variate from the mean of its series as a deviate. The task is to summarize 
these deviates into a single value that has, as far as possible, the qualities 
stipulated in the previous chapter for a good descriptive st^atistic. If 
these conditions are adequately fulfilled it may serve as our description 
of the amount of variation in any series. 

There is no simpler form of summary than that provided by determin- 
ing the average value. Would the average of the deviates conform with 
requirements for the summary of variation? Unfortunately the answer 
must be in the negative. Since the mean is the ‘‘balancing point” of the 
variates, the sum of the positive deviations from the mean must equal 
that of the negative deviates, regardless of the amount of variation 
present in the series. Algebraically, 

2(i — x) Sa: ^ 

This difficulty of positive and negative deviations which just balance 
one another in the sum may be removed in either of two simple ways. 
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Firsts the deviations may be considered without regard to their signs ^ 
all deviates may arbitrarily be considered as positive. Second, the 
deviations may be squared, or indeed raised to any even power, to give 
quantities of positive sign throughout. 

The average of the deviates taken without regard to their signs has 
been used somewhat as a measure of variation. It is often called the 
average deviation. This quantity is very easy of comprehension and 
would appear to summarize the deviations satisfactorily. However, it 
is not widely useful as a measure of variation in more advanced statistical 
work. This is due to the great difficulty in solving algebraic equations 
involving quantities whose signs are to be ignored. Knowledge concern- 
ing such procedures is comparatively scarce, and the statistical worker 
would seriously limit his mathematical field, cutting himself adrift at 
once from a vast amount of that very logic which it is his desire to turn to 
his advantage, if he arbitrarily ruled all deviates as positive quantities. 
Consideration of the positive and negative deviations in two separated 
groups would also be unsatisfactory, since such a procedure would entail 
the use of two magnitudes to express the one idea of amount of varia- 
tion, and would involve any further reasoning based on that idea with 
considerable complexity. 

It is a derived property of numbers that the product of two negative 
quantities must itself be a positive number. All have learned the ele- 
mentary rule of algebra flowing from this, that the square of any number 
is always positive, regardless of the sign of that number. The expedient 
of ridding the problem of the measurement of variation from a vexatious 
difficulty imposed by the signs of the deviates, through using the squared 
values rather than the absolute deviations, is very appealing in its 
mathematical simplicity. Very early in the quantitative measurement 
of variation this expedient was resorted to, and the results have proved 
so useful that the general statistical practice today is to measure varia- 
tion as a mathematical function of the squared deviates, {x — x)^. 
Summing these quantities and dividing by N, one will secure an average 
value typifying the squared deviations. Symbolizing this average by 
m2, for it is commonly known as the second-moment coefficient, the 
equation for the mean squared deviate becomes 


m2 


S(a: “ x)^ 
N 


( 1 ) 


In the complete absence of variation, m2 must obviously be zero. 
When only a little variation is present, the values of (x — x)^ are individ- 
ually small and therefore their average is small. As the values of x 
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differ increasingly among themselves, m 2 must rise rapidly in magnitude. 
Also, the mean squared deviate is fully representative of all members of 
the series in that it takes into account every variate in the same manner. 

As a numerical quantity this mean squared deviate seems to fulfill 
those conditions which are logically to be imposed for the measurement 
of deviation from type. However, variation is in our everyday thinking 
a matter of simple differences, not of squared differences. Squaring 
was resorted to for the purely mathematical reason of eliminating signs. 
The immediate objective being accomplished, one may now reverse the 
move by extracting the square root of m 2 so as to return to the original 
scale of measurement. These adjustments of squaring the deviates, 
then taking the square root of their mean, comprise transformations to 
and from a scale of squares made for convenience in hurdling an obstacle. 
The resultant quantity is a magnitude on the original scale of measure- 
ment and one directly proportional to the amount of variation present. 
It was called the standard deviation by Karl Pearson in 1894, and indeed 
it has since then become in fact the standard measure of variation. 

A statistic of tremendous utility, the standard deviation lends itself 
admirably to use in advanced statistical work. Every effort given 
immediately to comprehend its nature and meaning as a measure of 
variation will be well repaid. Symbolizing it by s, the fundamental 
definitive equation may be written in the form 



CALCULATION PROCEDURE FOR THE STANDARD DEVIATION 
The Transfer from Deviates to Variates 

The basic definition of the standard deviation is shown at a glance 
by equation (2) above. This formula is ill suited for computational 
purposes, however. The deviations from the mean when numerically 
expressed are generally cumbersome, and squaring them is an onerous 
procedure providing considerable opportunity for mistakes. Harris 
pointed out in 1910 that a simple transfer from deviates to variates very 
greatly reduces the work involved. The rearrangement may be shown 
algebraically as follows: 

2 S(a: — x)^ 

N 

_ ~ 2xx + x^) 

N 
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But 

Therefore 

and 


N N ^ N 


'Zx 

iv 


^2 

iV 


2:f 


N “ 


^2. 


^2 

iV 



(3) 

(4) 


Thus the standard deviation of a series of magnitudes may be derived 
expeditiously from the means of the first and second powers of the vari- 
ates; the individual deviates need not be computed. 

When a calculating machine is available and N is not too large (not 
in excess of 100, say), Zx and Zx^ may be accumulated directly on the 
machine without previous seriation of the data. The mean and mean 
square may then be determined, the square root of the difference between 
them giving the standard deviation. . Applying this procedure to the 
erythrocyte counts given by Haden for 40 normal men,^ the following 
record sheet of computations was secured. 


a: = red blood cell count 


Series: Haden (1) 


V = 40 
2a: = 198.91 

= 993.5313 

si - 0.110040 


■i = 4.97275 

— = 24.838283 
N 

= 0.331723 


General Summations from Frequency Tables 

It seems desirable to emphasize at this point that, in our algebraic 
notation, Z has boon defined as indicating the sum of all the values 
directly following it. The symbol for the variable, for instance x, when 
placed after a summation sign is understood to indicate the individual 
values of x, the sum being made for all variate magnitudes. 

The class centers in a frequency distribution really correspond to the 
variate magnitudes under an assumption that measurement was remade 
to the nearest value according to the scale of graduations given by those 
class centers. In the tables to be given immediately following in this 


^ Vide page 38. 
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chapter, these centers will be designated as Xc solely for purposes of dis- 
tinction. This permits more convenient demonstration of the point 
that general summations made from frequency tables proceed by two 
steps. In securing Zx to calculate the mean, for instance, one first 
multiplies the class center by the frequency to get the sum of all x values 
in the class. Then one totals these products. Forgetting that mul- 
tiplication is simply repeated addition of the same quantity, the reader 
might easily overlook the preliminary multiplication as being part of the 
general summation. If there are k classes in a frequency table, then 

Zx as applied to it is identical with ZfXc. We shall see that not only the 
first and second but also higher powers of x enter statistical cornputa' 
tions, and summation of them called for by the general algebraic nota- 

i 

tion Zx^ reduces to Z/x^ special frequency-table notation. The gen- 
eral rule may be stated as follows : If data are seriated into a frequency 
table and a sum over all N values for the variable is called for, each class 
center must be counted as many times as is indicated by the frequency of 
the class. 

The Uncoded Frequency Table 

Table 9 with its appended summary of derived values indicates the 
steps involved in securing the mean and standard deviation for data 
as-sembled in the form of a frequency table. The material is that already 
presented in histogram form as Fig. 8,^ the seriation being that of panel 
(D). The first four columns of the table represent familiar steps. The 
last column, headed /Xc) is assembled by securing the products of the 
corresponding values in the Xc and fxc columns, its sum being Zx^ of the 
general algebraic notation. With a calculating machine there would 
not be any need to include this last column in the table as the products 
could be accumulated in a sum directly on the machine as each is secured. 

Code Scales 

Uniform grouping in a frequency table permits one to use code 
scales to speed up calculations. The standard deviation may be secured 
along, with the mean on an x scale, the results then being transformed 
to the original scale. This transformation equation for converting 
to s* may be derived very simply as follows: 

Since x = a + 6x, and x = ai- bx, 

then X — X = 6(x — x). 

2 Vide page 31. 
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TABLE 9 


Calculation of Mean and Standard Deviation for Weight (in Pounds) at 
Birth of 402 Infants 


Class range 

Xc 

/ 

/x. 

Jxl 

5.0- 5.49 

5.25 

7 

36.75 

192.9375 

5.5- 5.99 

5.75 

28 

161.00 

925.7500 

6.0- 6.49 

6.25 

56 

350.00 

2,187.5000 

6.5- 6.99 

6.75 

72 

486.00 

3,280.5000 

7.0- 7.49 

7.25 

86 

623.50 

4,520.3750 

7.5- 7.99 

7.75 

65 

503.75 

3,904.0625 

8.0- 8.49 

8.25 

42 

346.50 

2,858.6250 

8.5- 8.99 

8.75 

22 

192.50 

1,684.3750 

9.0- 9.49 

9.25 

13 

120.25 

1,112.3125 

9.5- 9.99 

9.75 

7 

68.25 

665,4375 

10.0-10.49 

10.25 

3 

30.75 

315.1875 

10.5-10.99 

' 10.75 

1 

10.75 

115,5625 



402 

.2930.00 

^ 21,762.6250 

li II 

402 

2930.00. 


X = 7.288557. 


21762.6250. 



125883. 

Si = 

1.012820. 


Sx = 1.006390. 


Now 


Therefore, 


Sx 


N 

N 




2(x - x)2 


N 


si 




(5) 


The coding constant a does not enter this transformation equation. 
Subtraction of a constant quantity from a set of variates will leave the 
residuals with the same variability as characterized the original meas- 
urements. 

The full calculations for the mean and standard deviation of the series 
used in Tabic 9 are given through use of coding in Table 10. The sim- 
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TABLE 10 

Calculations as for Table 9 but Using Coding 


Xc 

/ 

Xc 

/Xc 

* A? 

5.25 

7 

0 

0 

0 

5.75 

28 

1 

28 

28 

6.25 

56 

2 

112 

224 

6.75 

72 

3 

216 

648 

7.25 

86 

4 

344 

1,376 

7.75 

65 

5 

325 

1,625 

8.26 

42 

6 

252 

1,512 

8.75 

22 

7 

154 

1,078 

9.25 

13 

8 

104 

832 

9.75 

7 

9 

63 

667 

10.25 

3 

10 

30 

300 

10.75 

1 

11 

11 

121 


402 i 


1,639 

8,311 

o 

II 

a = 5.25. 


h = 0.5. 


Sx = 1639. 

% = 

4.077114. 




2x2 _ 


i = fl 4- 6x = 7.288557. 

2x2 = g3ij 

'W " 

20.674129. 



si = 4.051270. s, = 

2.012777. 

Sx = hs 

= 1.006389. 


plification of the calculations by transformation of scale may again be 
attested by the ease with which the work in Table 10 may b^, verified by 
mental arithmetic. The final results are of course identical (within 
last-place modification errors) with those secured in Table 9. 


“LEAST SQUARES" 

In accepting the mean as the typical value about which to measure 
variation, and then establishing the standard deviation as an appropriate 
measure of the variation, we have associated two descriptive statistics 
that are held closely together by a natural mathematical bond which we 
are now in a position to reveal. Let us consider the change in the “mean 
square deviation” about any point a as the latter moves along the scale 
of measurement. Instead of subtracting x from every value of x, let 
us subtract some other constant, a. Then the mean square deviation 
about a will be 
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M.S.D. 


^(x - a)^ 

N 

^~2ax + a^ 

N 

yo.2 

^ “ 2ax + 

N 

Si + (:^ - fl)^- 


It is immateml whether a is greater or less than x as far as the sign of 
{x — a) 2 is concerned; this sign will always be positive. Therefore the 
mean square deviation will always be greater than the squared standard 
deviation unless a is made equal to x. From this it follows directly that 
the mean of any series is the value from which the variates as a whole 
deviate the least, provided that variation is measured by squaring the 
deviations. The standard deviation measures the variation which has 
been so minimized. Mathematicians express this property of the mean 
by saying that it is the “least squares solution” of finding a value most 
closely approximating any series of magnitudes. 

This affinity of mean and standard deviation may naturally raise the 
question : If the deviations had been taken without regard to sign instead 
of squaring them, would the mean still remain as the point about which 
variation so measured would be a minimum? Let the mean deviation 
be defined in general form as follows : 



wherein the vertical rules indicate that the sign of the quantity enclosed 
by them is to be ignored. For what value of a is this mean deviation a 
minimum? The complexity of the derivation necessitates that this 
question be answered here without proof. The answer is quite cer- 
tain, however, that the variates deviate least from their median value 
if first-power deviations without signs are used as the measure of varia- 
tion. It seems very reasonable then that the “average deviation” 
should be taken about the median rather than about the mean; the 
median and this mean deviation have the same affinity as the mean and 
standard deviation. 

In view of these facts and the probably greater simplicity of compre- 
hension of the median and mean deviation, it is very reasonable to ask 
why statisticians work so generally in terms of the mean and standard 
deviation. The explanation arises from precisely the same type of 
reasoning as led to the question. We are in search of simplicity. “Least 



60 


THE MEASUREMENT OF VARIATION 


squares” solutions, or the minimizing of squared deviations to reach 
representative values, may be made with facility, whereas the minimiz- 
ing of absolute deviations becomes so involved that the problems can 
rarely be solved that way. Now problems of defining a moving repre- 
sentative value prove multitudinous in statistical analysis when one 
passes to the mutual relationships of two or more variables, and therefore 
consideration of the simplicity with which solution of such problems 
in general may be reached assumes paramount importance. If there 
is no difficulty in accepting the mean as a representative value, then its 
logical associate in measuring variation should likewise be acceptable. 
The use of the standard deviation as a measure of variation may at first 
offer the student some tasks in comprehension, but it will pave the way 
to progress where the average deviation would lead into a blind alley. 


INDEPENDENT DEVIATIONS AND VARIANCE 

During our preliminary discussion of the variation shown by N vari- 
ates, attention was drawn to the point that fundamentally only N — I 

N 

independent differences exist between variates. Of the ^ (AT — 1) pos- 
sible individual differences, iV — 1 arise through comparing a selected 
variate with the others. The remainder arise from comparing the others 
with one another, all these secondary differences being derivable from 
what we may call the iV — 1 primary deviations. Take a simple series 
of 3 variates, xi, 3:2, and xs. The primary differences may be given as 
X2 — xi and x^ — xi. The only possible secondary difference here is 
X2 xzj which is just [{x2 ~ xi) — (3:3 — 3; i)], or the difference between 
the primary differences. This may be extended to a series of any size. 
No information about variation is gained by considering the secondary 
differences. 

When deviations are taken from the mean instead of from some select- 
ed variate, quite obviously N differences (in this particular case we have 
called them deviates) arise. Has one additional element of information 
about variation been gained? Certainly not ! Each deviate is to some 
extent smaller than it would be if we demanded independent deviations. 
Each deviate is, to a small degree, a comparison of a variate with itself, 
for X is the sum of the Wth parts of all variates. The extra deviate sim- 
ply makes up for this. Fundamentally there are only W — 1 independ- 
ent deviations. One might therefore argue that the sum of their squares 
should be divided by — 1 instead of by N. This slight change in the 
divisor may be considered as inconsequential unhss N is small and one is 
desirous of estimating the variation in the supply directly from the 
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sample. The adjustment play^ a most important role in the '‘small- 
sample” techniques, to the development of which Fisher and others have 
contributed so notably. Detailed consideration of these techniques 
seems to the writer to belong as a sequel to an introductory discussion of 
statistical reasoning, and development of them will be reserved for an- 
other volume. The reader, however, is now in a position to understand 
a variant of the formula just given for the standard deviation. It is the 
square root of the variance, v, where 


Vx 


l{x - x)^ 
N - 1 


( 6 ) 


The square root of v should not be called the standard deviation of the 
sample, but rather the maximum likelihood estimate (M.L.E.) of the 
standard deviation a, of the supply. It may be derived very simply 
from Sx by means of a correction factor. 

M.L.E,. = (7) 


The reader may derive this factor for himself very readily from equations 
(2) and (6). 


THE QUARTILES 

Description of frequency distributions in terms of their “probability 
points^ has been initiated in the discussion of the median as a central 
value. Its representativeness depends on its “middlemost” character. 
There is a tery simple extension of this idea that permits of the portrayal 
of variation in the same terms. 

Half of the variates in any series deviate in the positive sense from 
the median value, the other half being negative deviations. One may 
choose to establish the “middlemost deviation” in each direction. Such 
values taken with the median itself clearly divide the whole ranked series 
into 4 equal frequency groups. They are therefore called the quartile 
values, the Wer being known as the first quartile (00, the median form- 
ing the second, and the upper value the third quartile (O 3 ). The value 
given by one-half of the difference between the first and third quartiles is 
used on occasion as a description of the variation present; it is usually 
called the semi-interquartik range and is symbolized as Q, Then 


2 


(8) 
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It is worthy of note that, like the median, Q is not affected by extreme 
variates. It is a measure of mediocrity in variation, and is based on fre- 
quency of variates rather than individual deviation magnitudes. 

When working from a frequency table rather than ranked variates, 
determination of the quartile values usually involves some calculation 
because the points must be assigned within class ranges. To clarify 
these matters an illustration of procedure is given in Table 11, utilizing 
the data on examination scores graphically portrayed in Fig. 2. The 
assumption underlying selection of an appropriate point within a class 
range is that the frequency of the class is uniformly distributed over the 
range. Attractive because of its simplicity, this assumption may be 
replaced, if desired, by others of more elaborate nature giving better 
approximation to frequency-curve form. We shall use the simple one 
herein. A “cumulative frequency” column is formed in which the fre- 
quency up to the boundary line between classes is given. The class 
range within which each quartile must fall then becomes apparent by 
inspection. That proportion of the scale covered by the class is taken 
which, under the assumption, yields the necessary frequency. Referring 
to Table 11, the first quartile will fall at the value which will cut off 
the low^er 112.25 out of 449 units of frequency. Sixty-three units exist 
below a score of 65, and 122 below a score of 70. To 65 must be added 
1 12 25 63 

the proportion ■ of the class range immediately above 65. The 

evaluation of all quartile values is appended to the table. Figure 12 
shows the quartiJes plotted on the histogram, each shaded area compris- 
ing one-quarter of the whole, the dividing ordinates rising at the quartile 
values. 

FIGURE 12 

Illusteating the Quartile Values in a Set of 449 Examination Scores 



• Data and calculations in Table 11, page 63. 
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TABLE 11 


Cumulative Frequency and the Calculation of Quartileb 

(Scores of 449 students in an examination*.) 



* Data by courtesy of the Committee on Educational Hesearch, University of Minnesota, 
t Cumulative frequency up to boundary between classes. 
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This same principle may be extended to the fractionation of the total 
frequency into any systematic scheme of parts desired. Among these, 
deciles and percentiles are not infrequently used in discussions of educa- 
tional data. The reader may readily gain some experience with these 
finer divisions and the problems they present by first appl)dng the above 
principles of procedure to the data of Table 11. 

RELATIVE VARIATION 

For the comparison of the variabilities existent in different series, it 
is sometimes necessary to express variation on a relative scale. For 
example, the standard deviation of finger length (left middle finger) for 
the series of Fig. 1 is 0.5479 mm. For the same group of individuals, the 
standard deviations for head length and stature were found to be 6.046 
mm. and 2.5410 inches, respectively. Direct comparison of these stand- 
ard deviations is not very informative because of the difference in general 
level of magnitude of these dimensions. Moreover, one is expressed in 
inches and the others in milfimeters. If, however, the standard devia- 
tions are expressed as percentages of the respective means, yielding 4.74 
per cent, 3.15 per cent, and 3.88 per cent, respectively, it is seen immedi- 
ately that finger length is relatively the most variable, and head length 
relatively the least variable, of the three. This relative measure of devi- 
ation is known as the coefficient of variation : 

C.V. = 100-- (9) 

X 

A word of warning is warranted with reference to the use of this 
measure. In its interpretation, caution must always be exercised not to 
neglect the fact that its magnitude is just as much a function of the mean 
as it is of the standard deviation. Variabilities measured on fully com- 
parable scales should be compared directly in terms of the standard 
deviations as far as is practical. If a relative measure is necessary in 
order to circumvent the use of different scales or to allow for incompat- 
ibility of means, care should be exercised to assure oneself that it is 
wholly logical to consider the variation in proportion to the mean value as 
a basis of comparison. The use of the coefficient of variation may give 
valuable information on questions such as the following: Is human 
stature more variable in the adult than in the new-born infant? Is the 
error of the chemical determination of ash in foods relatively greater 
than that for protein in the same materials? It would be quite foolish 
to compare the error of a measurement process with the variability of 
individuals so measured, by means of the coefficient of variation. 
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TYPE OF VARIATION 

The foregoing discussion of variation has been strictly confined to the 
question of amounl of variation. Questions of skewness, kurtosis, et 
cetera, have been intentionally divorced from the discussion, for they are 
concerned with iy^e of variation. In considerations of the amount of 
variation, the signs of the differences between variates are irrelevant; one 
is concerned solely with consistency of measurements, like the scattering 
of bullet holes about their focal point on a target. We shall now pass to a 
general scheme of measuring the characteristic features of frequency 
distributions in which x and Sx form the first elements, measures of skew- 
ness and kurtosis following directly in the scheme. 



CHAPTER 5 


MOMENTS AND DISTRIBUTION CHARACTERISTICS 

The student of mechanics very early becomes familiar with the con- 
cept of '‘the moment of a force about a point,” or the importance of the 
force in producing (or tending to produce) motion about an axis. In 
games of “seesaw,” a child learning to balance another on a beam, which 
itself is balanced about a “fulcrum,” discovers that the effectiveness of its 
weight on the beam is directly proportional to its distance from the ful- 
crum. A sketch of such a system in equilibrium is given below for 
reference. When equilibrium is established it is found that the product 
of a weight A and its distance a from the fulcrum F is equal to the similar 
product for B. These products are termed “moments” in mechanics. 
The two weights A and B exert their force in the same downward direc- 
tion. The distances a and j3 are measured in 
opposite senses and therefore are appropriately 
given opposite signs. The two moments then 
are —Aa and +Rj3, the sum being zero when 
equilibrium prevails. 

The effectiveness of a weight on a horizontal beam^ with respect to 
its tendency to produce motion may similarly be measured \vith respect 
to any point in the beam. The fulcrum may be moved to any point, 
each such point of reference being known as an origin of the calculated 
moment. If there are k objects on a weightless rod which is scaled in 
equal increments from some point (for example, one end, although that is 
immaterial), then the total moment of the system about any fixed 
origin a in the beam is ^ven by the expression XJ(x — n), where / is each 
weight, X is its position point on the scale, and summation is made for all 
k products. 

Let X be the scale of a variable, of which a sample of total frequency 
N has been seriated into k classes, / designating the frequency of any 
class. A histogram drawn for this seriation may be visualized as a series 
of rectangles on a weightless beam, the weight of each rectangle being 
proportional to its area, that is, the class frequency in each case. If a is 

^The beam is theoretically assumed to be without weight of its own. 
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the point in the rod about which the system would be in balance if a ful- 
crum were placed there, then obviously a lies somewhere between the 
lowest class value, xi, and the highest class value, xn. The value of a 
may be determined very simply, since the condition of equilibrium 
demands that the total moment shall be zero. The moment of each 
class is/(a; — a), and by definition 

If(x - a) = 0. 

Since the only variable quantities here are / and x, the expression may be 
expanded to the form 

Xfx — dZf ~ 0, 

But 2/ = AT, 

and aS/ = Na. 

Therefore 'Lfx = Na, 

'Lfx 

or a = ~ = X. 

N 

Thus the mean of any series is the balancing point with respect to weight 
of the observations on the scale of measurement. 

The general characters of frequency distributions which have already 
been noted, namely, 

(1) total frequency, 

(2) balancing point, 

(3) amount of variation, 

(4) ^skewness, and 

(5) kurtosis, 

may all be defined very readily in terms of a simple series of extensions of 
this moment concept. These extensions emanate from the basic step 
of considering not only the distance factor, x — a, but also the series of 
powers from zero through the fourth of this factor. Let us analyze the 
expression for the mean squared deviate, from which our standard meas- 
ure of variation is derived. Using special frequency-table notation {xe 
for class centers), 


mxc - x)^ 



This is simply the mean secoTid moment about the origin x. The order 
of the moment, designated by the qualifying adjective, is the power to 
which the distance factor or deviation is raised. If each value of x is 
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considered individually the equation reduces to the general algebraic 
form 

Zix - 


OT2 = ■ 


N 


f being reduced to the constant 1 and therefore omitted. 

One may conveniently generaliije the terminology for statistical 
moments according to the following scheme : 

(1) The nth moment of an individual x about an origin a on its 
scale of measurement is {x — a) 

(2) The total nth moment of a series about the origin a is 
2(:c — a)". 

(3) The mean nth moment of a series of N individuals about an 

. . . S(j — a)" 

origin a is 

The statistician in his studies of variables is concerned with average 
values derived from series of individuals, that is, with single figure repre- 
sentations of mass characteristics. The moment of an individual and 
the total moment of a series are but steps, as it were, to an average 
m(>ment of some sort. Also he is very largely concerned for reasons of 
convenience with two particular origins, the mean value of the series and 
the scale ori^n of zero. In the theoretical development of statistical 
concepts it is most convenient to refer measurements to the mean as 
origin, and as a consequence a special designation in word and symbolism 
has arisen for the average moments about the mean as origin. These 
are known as the moment coefficients, and the symbols used fpr them are 
m with definitive subscripts. Hence the previous use of m 2 , meaning 
the second-moment coefficient, as the symbolic designation for the mean 
squared deviate. We shall later use m' as the symbol for a mean mo- 
ment taken about zero as origin. In accordance with this notation, 

, Ix 


m2 ■■ 


N ' 


and so forth. 

It is interesting, though quite incidental, to note that one total 
moment is used as such, namely, the total zeroth moment. This is the 
total frequency of any series, designated herein as N. It is a derived 
property of numbers that any quantity when raised to the zeroth power 
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is unity. Thus, {x - af equals unity, and therefore S(x - a)° equals 
; the frequency of a series is a total moment, and all mean moments 
are therefore ratios of total moments. 

One may hope that the utility of the lower moments for describing 
properties of frequency distributions will by now have appealed to the 
reader. The zeroth defines frequency, the first defines the most broadly 
useful measure of type, and the second provides the standard measure of 
variation. When it is realized that the characteristics of skewness and 
kurtosis in frequency distributions may likewise be defined very simply 
in terms of the third- and fourth-moment coefficients respectively, some 
appreciation of the power of analysis by means of moments will have 
been achieved. Before proceeding with demonstration of this, how- 
ever, perhaps one might be pardoned for a digression into the historical 
basis for the use of the term "moment.” 

It was while encumbent of the professorship of applied mathematics 
and mechanics at University College, London, with students of engineer- 
ing chiefly forming his classes, that Karl Pearson laid many of the founda- 
tions of modern statistical analysis. Challenged by visions implanted 
by Francis Galton^s masterful book "Natural Inheritance” (1889), and 
immensely stimulated by Weldon (professor of zoology at University 
College) and Gallon to blaze trails on the frontier of application of 
mathematical reasoning to the resolution of hiological problems, Pear- 
son faced the task of objectively defining more precisely and more com- 
pletely the characteristics of frequency distributions. In extension of 
the mechanical concept of moments he perceived possibilities for the 
development of powerful statistical tools for this purpose, possibilities 
which he soon developed into realities. The zeroth, first, and second 
moments w^re already in use though not recognized as such. Let us 
return now to a consideration of the utility of the third and fourth 
moments. 


SKEWNESS 

In perfectly symmetrical distributions it is easy to perceive that all 
the odd-moment coefficients must be zero. The odd powers of a: — x 
retain the sign of the deviate itself, and, each positive deviation being 
exactly balanced in frequency by that of the deviation of opposite sign, 
the sum must reduce to zero. Perfect symmetry in a random sample 
from a symmetrical supply of variates would, of course, not be expected 
except* as a coincidence. However, the third-moment coefficient for 
such random samples in general will tend toward zero and differ from 
that value only within the relatively small margin of sampling errors. 
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On the other hand, if the distribution of frequency is skew this con- 
dition will not apply, Equal magnitudes of positive and negative 
deviates will have systematically differing frequencies, and the cubing 
of the deviates, giving rapidly increasing importance in the sum to each 
deviate as its magnitude increases, will reflect the skewness in the more 
or less marked departure of the third-moment coeflicient from zero. 
The effect is that of making the “tail wag the dog.’^ The long tail of 
the distribution will contribute more to the total than the deviates of 
opposed sign can offset, and so the total third moment and m3 will 
assume the sign of the direction of the long tail. Also, the longer the 
tail (relatively), or the greater the skewness, the larger will m3 become. 
The same feature will prevail with every higher odd moment; the higher 
the moment the greater the magnitude of the sum or coefficient. It is 
simpler, of course, to use the lowest power that will achieve the purpose 
satisfactorily, and so m3 forms the basis of measuring skewness. 

It is perhaps helpful to consider here a concrete example of skewness. 
The data in Table 12 represent an arbitrarily defined frequency disr 
tribution of 100 individuals on a simple scale. It has been devised as a 
perfectly symmetrical distribution. By shifting two individuals in this 
distribution, one from class x = 2 into class = 0, and one from class 
X ~ Z into class x = 5, perceptible skewness has been introduced with- 


TABLE 12 

Illustrating the Effect of Skewness on the Third Moment 


Class 

center 

X 

Frequency 

/ 

Deviate 

x—x 

Moments 

first 

f{x-x) 

third 

f{x-x)^ 

0 


1 

_4 


- 4 


-64 

1 

2 

2 

-3 

- 6 

- 6 

-54 

-54 

2 

9 

8 

-2 

-18 

~16 

-72 

-64 

3 

23 

22 


-23 

-22 

-23 

-22 

4 I 

32 

S2 

0 

0 

0 

0 

0 

5 

23 


+ 1 

+23 

+^^ 

+23 


6 1 

9 

9 

+2 

+ 18 

+ 13 

+72 

-^72 

7 

2 

2 

+3 

+ 6 

+ 6 

+54 

+54 

Totals 

100 

100 


0 

0 

0 

-54 

Moment coefficients 

0 

0 

0 

-0.64 
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out altering the mean value, f = 4. The frequencies and calculations 
for the skew series are given in italics in juxtaposition with those of the 
symmetrical distribution. The effect of the shift of two individuals on 
2(x — x)^ and m 3 may now be studied numerically. For the symmet- 
rical distribution, S(x — x)^ and m 3 are obviously zero. For the skew 
distribution, however, they are negative. This is purely a result of 
raising the deviates to a higher odd power than the first, wherein they 
balanced regardless of form of distribution. 

KURTOSIS 

When the even moments are considered, it will be recognized that, 
although the effect of weighting larger deviates through using higher 
powers persists, the class moments now all have the same sign. Thus 
the even moments cannot reflect skewness; the result is the same as if 
the negative half of the distribution of deviates were folded over and 
added to the positive half. The effect of higher even powers is to 
accentuate the relative importance of the extreme deviations regardless 
of their signs in the first power. B"or this reason the standard deviation 
is always greater than the average deviation. ^ If the fourth power of 
the deviates is used, still greater emphasis will be given in the sum to 
the presence of relatively long tails in a distribution, regardless of skew- 
ness. Here, then, we are dealing with relative concentration of the 
variates about the mean. Two distributions having the same standard 
deviation will have different values of m 4 if one has longer tails than the 
other, the higher m 4 going with the “longer-tailed” distribution. To 
get such longer tails and keep the standard deviation the same, it is 
necessary to^move frequency from the medium deviation group (which 
we may call the “shoulders” of the distribution), placing some individuals 
nearer the mean and some farther away. Thus “peakedness” at the 
center is increased along with tail extension when mi is increased without 
altering the amount of variation. It was because of this that the term 
kurtosis (implying sharpness of curvature at the center) was introduced 
by Karl Pearson for this feature of relative clustering about the mean. 

The use of powers of deviates has introduced a method of quantita- 
tively describing certain readily recognizable characteristics of frequency 
distributions. Each mean moment secured is, however, a count of units 
which are themselves the appropriate powers of the original unit of 

^ For reasonably normal distributions the standard deviation will be found to be 
approximately 25 per cent greater than the average deviation. For the normal 
curve itself the relationship is 


S.D. = 1.2533 A.D. 
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measurement of the variable.® Skewness and kurtosis, on the other 
hand^ are concepts that are basically independent of any scale of meas- 
urement of a variable, and description of them should logically proceed 
on a scale of pure numbers. Only in this manner may degrees of both 
skewness and kurtosis be quantitatively described in terms which are 
directly comparable from one distribution to another, regardless of the 
scales of measurement of those variables. 

Elimination of the units of measurement from moment coefficients 
may readily be secured by taking suitable ratios of them. Since the 
immediate problem is to transform the m3 and m4 values to pure number 
scales, and m2 is the lowest moment coefficient not having a fixed value, 
ratios involving m2 as the simplest reference unit would appear to offer 
possibilities. Karl Pearson suggested two such ratios which have proved 
very serviceable and are widely used. Employing Greek letters where 
we use the English forms, he called them the “beta coefficients. “ In 
our symbolism they may be written 

61 = , an index of skewness, (1) 

m2 


and 


&2 = — 2 7 index of kurtosis. 
m2 


( 2 ) 


Both numerator and denominator of the ratio for bi involve sixth powers 
of the original unit of measurement, and therefore the coefficient is a 
pure number. Likewise, 62 is a pure number derived from fourth-power 
units. The utility of such empirical ratios should be judged in terms of 
how well in practice they serve the purpose of description for which they 
are offered. The answer arising from experience is that basically they 
serve very well indeed. Certain slight modifications, aiming to introduce 
direction of departure from normal curve values, are preferred by some. 
These latter indices are designated as the “gamma coefficients" or 
“gf statistics": 


0i 


m3 


, an index of skewness, 


( 3 ) 


and 


02 ~ — 3 , an index of kurtosis. 


( 4 ) 


® The dimension ‘‘12 inches^* when squared yields “144 square inches,” and so 
on for higher powers. Extracting the square root of m 2 to get the standard deviation 
had, as its objective, a return to the original scale of measurement. 
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Whereas the value of &i is zero for S 5 TnraetricaI distributions (since 
m 3 = 0 ), it is positive for all other curves because m 3 was squared to 
secure it. g\j on the other hand, preserves the sign appropriate to the 
direction of ske\^ess. Obviously, squaring gi gives 61 . 

Figure 13 presents three curves having different degrees of skewness 
but otherwise of like character. The h and g coefficients are given in 
proximity to these curves, the middle curve being the symmetrical 
normal curve. 

FIGURE 13 

71 AS A Measure of Skewness 



Scale Df relative deviates 


The fourth-moment coefficient, m<}, is always positive, and therefore 
62 does not automatically provide a central reference value of zero. 
Since the normal curve is a logical central type from which departure 
may be measured with respect to kurtosis as well as skewness, and it has 
the simple whole-number value of 3 for its 62 coefficient, kurtosis may be 
measured directly from this origin. 

g2 = hz — 3. 

Thus gz introduces the idea of negative and positive kurtosis with 
respect to the normal curve as a standard of reference, positive and nega- 
tive values of gz automatically specifying greater or less central cluster- 
ing or ‘ 'peakedness” than the normal curve possesses. Figure 14 pre- 
sents three superimposed symmetrical l-shaped curves differing in 
kurtosis but otherwise similar. All curves have the same mean, stand- 
ard deviation, and area, the gamma coefficients changing directly with 
the degree of central clustering. 
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FIGURE 14 

72 AS A MeaSTJRE OF KURTOSIS 

ni 



CURVE FITTING 

The subject of fitting frequency curves to observed distributions is a 
very interesting one from the theoretical point of view. It has rather 
small practical value, however, in statistical investigations which are not 
of a specialized nature demanding such work. Present comment on the 
subject will therefore be confined to indication of the mode of procedure 
in what is probably the most widely used method of fitting frequency 
curves, that given by the Pearsonian system. 

When it is desired to fit a frequency curve to an observed frequency 
distribution a decision must be made as to the particular respects in 
which the curve must fit. That is, a curve is fitted to a histogram by 
selecting an equation for the curve which has certain properties identical 
with corresponding properties of the histogram. The larger the number 
of these identical values, the better the fit will be in general, but a curve 
may be “overfitted'' in the sense that the curve itself reproduces the 
histogram too well, preserving rather than smoothing out its errors of 
sampling. Thus a compromise is always necessary; as few identities are 
made as possible to secure reproduction of the essential features without 
overemphasizing small details. The system of moments leads very nicely 
to the establishment of descriptions of distribution characteristics on a 
broad basis, and Pearson has amplified their usefulness by elaborating a 
scheme of curves covering a wide array of combinations of these moments 
to the fourth order. 
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Starting with a differential equation which yields the normal curve as 
one of its solutions, Pearson derived a series of frequency functions which 
appeared to be adaptable to nearly all observed t5T)es of frequency dis- 
tribution. The two beta coefficients already defined, together with 
mean, standard deviation, and total frequency, are all the values needed 
from an observed distribution in order to select an appropriate curve 
from Pearson’s series for graduation purposes. The technical discus- 
sion of procedure in fitting is quite beyond the scope of this book. The 
interested reader may be referred for such detail to Elderton^ and to 
Pearson’s summary of procedure ^ with most useful graphs for selecting 
the appropriate type of curve. 

^ W. Palin Elderton. Frequency curves and correlation. Cambridge University 
Press. 3d edition, 1938. 

® Karl Pearson. Tables for statisticians and biometricians, Part II. Cambridge 
University Press. 1931. 
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THE NORMAL CURVE 


At the close of the second chapter the statement was made that 
biological variables for the most part do not deviate very markedly from 
the “normal curve” or “law of error” form. Both Quetelet and Galton, 
working with anthropometric variables, were deeply impressed by the 
way this curve seemed to graduate their data. The prominence they 
gave it probably contributed more than anything else to an early notion 
that this curve (which Galton often referred to as the “normal law”) 
was the normal type for biolo^cal variables, whereas we now recognize 
it only as a central type in a general scheme of biological frequency 
distributions. The designation of “normal curve” was accepted by 
Pearson as a convenient way of avoiding the issue of choosing between 
the originator titles “Gaussian curve,” “Laplacian curve,” and “Gauss- 
Laplace curve,” to which must now be added “DeMoivrian curve.” 
In adhering to the term normal curve herein, one wishes to warn the 
reader against any implication that other types of curves are “abnormal” 
in the usual sense of the word. 

While holding considerable interest as a graduating function that 
works remarkably well for many biological variables, the normal curve 
is far more important for its portrayal of the way in which means and 
other statistics vary through errors of random sampling. It commands 
so much interest that a somewhat more detailed analysis of its nature 
than has so far been presented seems appropriate at this point. 

Mathematical equations defining the normal curve may all be 
reduced to the simple form 


w 



( 1 ) 


where w = the ordinate at any point 
C = a constant of the curve, 
e ~ the natural logarithm base, 2.7183 • ■ ■ , and 
h ~ the relative deviate oi x from the mean of the curve. 


Even this equation may look complex enough to the biologist on first 
inspection, but if he is willing to refresh himself a little on exponents he 
may analyze the form of the curve without difficulty. 
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First of all, let us consider the transformation from the variate scale, 
to the relative deviate scale, k. The largeness or smallness of any value 
X may be expressed on the universal scale of pure numbers by securing 
its relative deviate value. This is simply the number of standard devia- 
tions that X is away from the mean of the series, either in the positive or 
negative direction. 



Both X — X and s, are dimensions on the scale of measurement of x. 
If we divide one by the other the original units of measurements (inches, 
pounds, et cetera) vanish, and k as defined above becomes a pure num- 
ber. This is a generalizing transformation of scale of great importance 
in statistics. It enables one to compare the largeness (or smallness) of a 
variate in its distribution with the same property for any other variate, 
whatever the variable may be. The transformation will be of great 
value to our development of the notion of correlation in the following 
chapter. At the present moment, however, it serves to place a standard 
scale under all normal distributions, permitting our study of the proper- 
ties of the curve as such. Once the mean and standard deviation of a 
variable have been secured, the universal scale of relative deviates may 
be placed in juxtaposition with its distribution. This k scale will have 
its 0 value against x, the + 1 and — 1 values respectively at the devia- 
tions of plus and minus one standard deviation from x, the scale 
extending without limit by such steps in both directions. 

The natural base of logarithms, e, is of no consequence to us at the 
moment beyond recognition of the fact that it is a pure number greater 
than unity*. Some knowledge is necessary, however, of the numbers 
which result when such a number is raised to powers. Since the zeroth 
power of any number is 1, and the first power of e is 2.7183, then any 
power of e between the zeroth and the first yields a number between 1 
and 2.7183. Powers of e above the first yield numbers greater than 
2.7183, of course, and all po.sitive powers of e will therefore yield num- 
bers greater than 1. We are now in a position to analyze the form of 
the normal curve as it is defined in equation (1). 

ORDINATES OF THE CURVE 

The ordinate of the normal curve will reach its greatest value C when 

reaches its minimum value of unity. This minimum value occurs 
only when k is zero, corresponding to the mean value on the scale of any 
distribution of reference. Since k is squared in equation (1) the ordinate 
at any chosen negative value of k must equal the ordinate at the same 
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positive value of k and be less than the value C at the center. There- 
fore the distribution curve is of the symmetrical I form. 

Those familiar with logarithms may readily calculate a series of 
ordinates and draw the curve. Arbitrarily making the mid-ordinate 
100, one finds the ordinates at 1, 2, and 3 standard deviations each way 
to be 60.7, 13.5, and 1,1 units approximately. Larger values of k than 
3 will give very small values of v), which, nevertheless, will be real; so the 
curve has a theoretically infinite range in both directions from the mean. 
This does not invalidate the curve as applied to biological data, for it 
may readily be seen that the probability of extreme deviates in normal 
distributions approaches infinitesimal magnitudes quite in accord with 
statistical experience of biological variables. 

A plot of the normal, frequency function scaled in terms of relative 
deviates is given in panel (A) of Fig. 15. Ordinates have been drawn 
at distances of 1, 2, and 3 standard deviations on either side of the mean. 
It win be noted immediately that the curve appears to merge into the 
base line at, or a trifle beyond, 3 standard deviations. At the points k 
equals -|-3 and —3, the height of the ordinate is only one-ninetieth part 
of that at the center, and the area in each tail so truncated is only a little 
over one-thousandth part of the whole area. For practical purposes the 
tangible range of the normal curve is accordingly often referred to as 
being roughly three standard deviations on both sides of the mean. As was 
noted above, the range is theoretically infinite, but the parts beyond ±3 
standard deviations are barely perceptible. 

PROBABILITIES OF THE CURVE 

In Chapter 2, emphasis was given to the point that, for continuous 
variables, frequency is properly portrayed by area. The frequency 
curve being the appropriate form of graduation for such variables, it 
follows at once that the area between any two ordinates, when expressed 
as a proportion of the entire area of the curve, defines the probability of 
occurrence of values between the two points on the scale at which those 
ordinates are drawn. This is a very important concept in statistics — 
indeed, the one upon which the final statistical verdict in many analyses 
rests. The normal curve plays so crucial a part in determining such 
verdicts that it is important for the student to acquaint himself some- 
what with the probability zones of the curve. It is partly for this reason 
that the ordinates establishing such zones have been drawn in panel (A) 
of Fig. 15. By means of a planimeter or other suitable device the areas 
between the ordinates in the figure could be determined. If these were 
then expressed as proportions of the total area of the curve, it would be 
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found, for instance, that a little more than two-thirds of the total area 
lies between the ordinates erected at d= 1 standard deviation, and approx- 
imately 95 per cent lies between ±2 standard deviations. 

FIGURE 15 

Relative Areas Between Symmetrically Placed Ordinates 
FOR THE Normal Curve 




These proportionate areas may be evaluated more exactly, indeed to 
any required degree of precision, by means of the integral calculus, and 
we shall rely now upon determinations so made. For the central zones 
indicated the values are given to four places of decimals. Note that 
99.73 (or more precisely 99.73002* ■ •) per cent of the area lies within ±3 
standard deviations; that is, the probability of a normally distributed 
variable having deviates exceeding ±3 standard deviations is slightly 
less than 27 in 10,000. 


80 


THE NORMAL CURVE 


The ordinates embracing the central 95 per cent of cases, or trun- 
cating the extreme 2.5 per cent of the total area in each direction, have 
come to play an important part in statistical interpretation. These 
ordinates arise at the k values of ±1.9600- • or what is roughly called 
two standard deviations on either side of the mean. That is, in a nor- 
mally distributed variable, there is only 1 chance in 20 that values will 
deviate from the mean by more than 1.96 standard deviations; the 
chance is actually 1 in 22 that values will deviate from the mean by more 
than two standard deviations. 

The question may be raised: At what distance from the mean would 
the two ordinates embracing the central 50 per cent of the area be 
placed? Such ordinates will define the zone within and outside of which 
it is equally prohahle that individuals will occur. Those ordinates arise 
at the points k equals ±0.6744898 ■ - * , The distance of these ordinates 
from the mean was extensively used by astronomers in describing the 
amount of error made in measuring the stellar universe. Later adopted 
by the biologists as a measure of variation, the quantity ±0.6745 
has become commonly known as the prohahle error. The reader may 
note that the probable-error points correspond to the first and third 
quartile values. 

Panel {B) of Fig. 15 shows a normal curve divided into zones by 
ordinates raised at ±1, ±2, and ±3 probable errors. Since the factor 
0.6745 differs only very slightly from two-thirds, the “5 per cent points” 
are approximated roughly by three probable error units, just as they are 
by two standard deviation units. These approximations are very useful 
to remember as simple guiding values in judging the importance of a 
deviation. 


TABLES OF THE NORMAL CURVE 

One’s readily available information on probabilities in general for 
the normal curve needs to be much more detailed than these few skele- 
ton values. A very frequently recurring problem in statistical analysis, 
especially when the variation is attributed to ^‘errors of random sam- 
pling,” is to ascertain with reasonable precision the probability that a 
specified value in a normal distribution will be exceeded. Occasion- 
ally, also, the normal curve ordinates are desired for plotting the curve. 
Accordingly, tables of these two functions are very common indeed. 
Such tables need to be in a generalized form so that they may be adapted 
to any normal distribution with its specified values of the mean and 
standard deviation. The most convenient arrangement is to give the 
ordinates and fractional areas for the curve in terms of k, the relative 
deviate, as a scale of measurement, assigning a total area of one unit to 
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the curve. Just as the use of relative deviates fixes the base scale, so 
the requirement that the area under the curve shall be unity fixes the 
vertical scale. Very full tables of ordinates and areas for the normal 
curve are so given in Part I of Pearson’s "Tables for Statisticians and 
Biometricians, ” and those concerned with fitting the normal curve or 
securing probabilities with considerable precision are referred to that 
source. 

In determining the probability that an event may be expected to 
occur through “errors of random sampling,” one is often quite satisfied 
with three or four places of decimals in that probability. Appendix I 
to this book gives for such purposes a table of the needed relative areas 
to four places of decimals, by increments of one-hundredth in h. The 
ordinates are also given in juxtaposition to three places of decimals. The 
area or probability, P, given in Appendix I, is that in the tail of the 
curve beyond the pomt expressed as a proportion of the area of half 
the curve. This is the same probability as that of both tails combined in 
relation to the whole curve. Discussion of the reason for choosing this 
particular probability function is given in the Appendix. 

SOME APPLICATIONS OF THE NORMAL CURVE TO 
OBSERVED DISTRIBUTIONS 

Although it is interesting to fit the appropriate curve to any suffi- 
ciently large .series of variates, it is not usually of any particular value to 
do so in practical analysis of data. Exercises in fitting a normal curve, 
therefore, become of academic interest rather than scientific importance. 
The situation is very different, of course, when one needs to consider the 
probabilities of occurrences obeying the normal law, the answers to these 
- propositions being derived from the tables of the normal curve areas. 
Those engaged in precise measurement may well follow with very keen 
interest the probability of the occurrence of a chance error of any given 
magnitude. If the errors are unbiased and normally distributed, 
knowledge of the mean and standard deviation of a sufficiently large 
series of repetitions of the measurement will give the key at once to those 
probabilities. In this connection the reader may choose to apply him- 
self to such questions with respect to the width of the spectral-line data 
given in Fig. 1 K The mean and standard deviation of width of line for 
the measurements are 178.9 and 3.583 microns, respectively. 

A problem of not infrequent occurrence in educational work is that of 
converting scores on aptitude or intelligence tests to their equivalent 
values on some more or less standard score scale, or to their equivalent 

^ Vide page 13. 
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values on the scale of any other specified test of like purpose. We may 
illustrate the underlying reasoning appropriate to such transfers in 
terms of application of the normal curve. The distribution of scores on 

FIGURE 16 

Determination of Equivalent Scores in Two Examinations by 
Equating the Relative Deviates 




« 55 S 75 SS M loT 


Score on Part £ of test 
40 


1 30 
^20 


^10 


Part B (current affairs) of a "reflective thinking” examination given to 
449 students has been used in Fig. 26^ as an illustration of the commonly 
occurring normal distribution of measurements of specific intellectual 
achievements. They are reproduced in panel (B) of Fig. 16, wherein 
the histogram follows the original marks without grouping. The dis- 
^ Vide page 14. 
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tribution of scores for the same students on Part A (science) of the 
examination is given in like form as panel (A) of Fig. 16. In both, a 
normal curve graduates the data satisfactorily. The actual scores on 
Part A varied from 1 to 39; those on Part B covered the much wider 
range from 47 to 108. For each score on Part A, what is the comparable 
score on Part B as far as performance of the group as a whole is con- 
cerned? 

Equating the possible ranges of marks, 0 to 49, and 0 to 123, respec- 
tively, would not answer this question, for that would assume the two 
examinations to be of equal difficulty. Equating the observed ranges of 
scores would also fail to answer the question, for it would ignore all marks 
but the extreme ones— an obvious blunder. One may solve the prob- 
lem appropriately only by equating those scores which are of equivalent 
probability of occurrence as indicated by the graduating curves. Since 
both curves are of the normal type, this is a simple problem; one needs 
only to find the scores for each test which have the same relative deviate 
values. The two distributions have been drawn in Fig, 16 with their 
means and standard deviations in correspondence. This gives to the 
corresponding points on the two scales equivalent probability according 
to the graduating curves. 

Panels (A) and (B) of Fig. 16 have been prepared solely to demon- 
strate the underlying principle. They are by no means necessary for 
solution of the problem. A simple graph with one score scale along the 
axis of abscissas and the other along the axis of ordinates might have 
been prepared. A straight line drawn across the surface and passing 
through the points (x, y) and (x + y + Sy) would establish the equiv- 
alent scores immediately. Such a graph forms panel (C) of Fig. 16, 
from which the equivalent scores may be read off with ease. 

It may be noted that the problem of finding scores of equivalent 
probability of occurrence is simplified in tKe above illustration because 
both sets of scores are normally distributed. The same procedure might 
also be followed when both distributions have the same basic form as 
evidenced by their '‘gamma” coefficients. However, distributions of 
essentially different form would necessitate rather involved calculation 
to find the equivalent probability scores. 

Consideration of the use of the normal curve in the interpretation 
of the significance of differences between means and other descriptive 
statistics must be deferred until discussion of the “errors of random 
sampling” is undertaken in a later chapter. Since acquaintance with 
the correlation coefficient will prove most important to adequate devel- 
opment of the principles involved in such tests, attention will first be 
directed to the description of associated variation. 
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BIVARIATE DISTRIBUTION AND THE COEFFICIENT OF 
CORRELATION 

Study of the variation in characters considered singly leads naturally 
to inquiries concerning the tendency of related characters to show associ- 
ated variation. Baking quality of flour is a variable characteristic of that 
product of the milling of wheat. Likewise the chemical composition of 
wheat varies from sample to sample. Is the former related in any man- 
ner to the latter? To what extent does the yield of sugar from a field of 
beets depend upon the rainfall or the quantity of fertilizer employed? Is 
the intelligence of the child more highly related to that of the mother 
than to that of the father? Does environment influence the error of 
personal judgment? A veritable host of such questions must have sug- 
gested themselves at one time or another to the imaginative student of 
biology. 

Francis Galton, however, was the first to give a practical solution to 
the problem in the form of a readily determinable and widely comparable 
numerical measure of the intensity of association between variations. He 
introduced to quantitative biology a new concept that has been most 
useful to a large body of statistical procedure. It has been revolutionary 
in its effect of displacing the idea of complete causation by one of in- 
complete association, the intensity of which may be measured with pre- 
cision. It is a great step forward in science to proceed from the measure- 
ment of things quantitative determination of the degree of inter- 
dependence between things. Galton’s contribution to science in this 
regard has had consequences reaching far beyond his specific problem of 
human inheritance. Among statisticians the importance of the concept 
itself is too often lost sight of in disputatious concerning the merits of 
various statistical solutions. 

The problem of measuring associated variation is again one of de- 
scribing the character of a frequency distribution. This time, however, 
the distribution must be extended in its spatial formation, since one 
dimension will be required for each variable considered. For the inter- 
relation of two variables, the frequency distribution may be designated as 
bivariate by way of distinction from the univariate systems so far con- 
sidered. 

S4 
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PORTRAYAL OF THE CORRELATION SURFACE 

The graphical representation of a bivariate frequency distribution 
does not present any difficulty. Two dimensions of space are required 
for the scales of measurement of the two variables, and these may be 
chosen conveniently as the axes of ordinates and abscissas of any cross- 
section paper. Each point on the surface so scaled will correspond to a 
pair of magnitudes of the variables. The task remains to find a suitable 
form of representation for the frequency of occurrence of these paired 
values. Undoubtedly, understanding of the correlation surface would be 
considerably aided if the third dimension perpendicular to the surface 
could readily be used for this purpose, but the preparation of such a solid 
is sufficiently difficult to forbid the general practice. Quite obviously, 
it ihnneessary in most cases to confine the representation to the two di- 
mcnsion§'%f a surface. 

Tw^o methods for the representation of frequency under such circum- 
stances without demanding another dimension of space are quite familiar 
to all. First, a dot may. be used for each pair of associated variates. The 
optical effect of the clustering of the dots in such a scatter diagram sug- 
gests density of population at once, and imparts to the device a decided 
superiority for graphical purposes. Second, the frequency of individuals 
occurring within prescribed class ranges may be given in numerical form, 
th\i^ providing a correlation table. This is a decided aid in computational 
work when the number of paired variates is large. 

One may consider,, by way of illustration, the distribution of fre- 
quency in the assortative mating of man for stature. Do tall men seek 
the companionship of tall wmmen, or is there a law of opposites that 
makes “brunettes and blondes rush irresistably together”? The exis- 
tence of suclf a law is not merely an assmnption of Percy Public; it was 
taugif not many years ago in the social sciences. That which is unusual 
claims interest, and tlie commonplace most frequently gives way to the 
unusual in an observer's eyes. Truth is often lost in the unrecognized 
pursuit of exceptions. 

An unbiased answer to our question on assortative mating will be 
provided by studying the dissociation in a random sample of the general 
population. Such a sample is presented in Fig. 17.^ Stature of the wife 
is represented on the scale of ordinates; stature of the husband is given 
by the abscissal scale. Since stature is truly continuous and proceeds 
by intangible increments, the entire surface represents possible combina- 
tions of magnitudes of the twu variables. 

\ ^ The scatter presented in Fig. 17 is derived from data published in another form 

Karl Pearson and Alice Lee in a study entitled “On the Laws of inheritance in 
man. I. Inheritance of physical characters.’* Vide Biometrika, 2: 357-462. 1903, 
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FIGURE 17 

Scatter Diagram of Assortative Mating for Stature in 1079 Husbands 

AND WlVES^ 



In the absence of assortative mating for stature, the dots representing 
actual matings would not tend to cluster about any line of trend. Neither 
should they be expected to be equally distributed over the entire surface, 
for all statures do not occur with equal frequency. Stature in each sex is 
. distributed very nearly in accordance with the normal curve. Hence, in 
the bivariate surface for assortative mating the clustering would be 
expected, in the absence of hkeness in stature of the matings, to occur 
about a central point given by the intersection of the two means, and to 
fall away symmetrically about this point. From a knowledge of the 
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univariate distributions, one may readily construct in advance the bi- 
variate surface for complete absence of association between the two 
variables. It must be quite clear from Fig. 17 that, far from there being 
a “law of opposites^' governing human matings, there is rather a tendency 
for marriage to take place between individuals of similar stature. 

The correlation table formed by grouping the frequencies within the 
class boundaries indicated in Fig. 17 is presented as Fig. 18. The superi- 
ority of the scatter diagram in portraying the association will be evident 
after comparison of these two presentations. But whereas the scatter 
diagram is most useful optically, the table is necessary for computational 
work. 


FIGURE 18 

Correlation Table for Assobtative Mating for iStature 
(Data of Figure 17) 
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Stature of husband in inches 


The representation of this bivariate surface may now conveniently be 
carried a step further to the three-dimension figupe. This may be 
achieved by erecting above the “cells of the table in Fig. 18 a series of 
rectangular prisms, each of a height proportionate to the frequency of the 
cell which forms its base. This is correct procedure since in the seriation 
all classes have the same range for each variable. Frequency now has the 
dimensions of volume; it is spread over an area, and the volumes will be 
proportional to height only if the cross section remains constant. 
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A diagram of a frequency solid so prepared for the assortative mating 
data is given as Fig. 19. The correspondence between this solid and the 
histograms of the univariate systems is immediately apparent. One may 
refer to Fig. 19 as that of a solid histogram. Just as the univariate 
histograms may be smoothed by curves, so the solid histograms may be 
graduated by smooth curving surfaces more appropriately representing 
the infinitely large populations from which samples are drawn. Such a 
graduation has been made for Fig. 19, and on the smooth curved surface 

FIGURE 19 

Perspective View of the Freoubnct Solid fob Assortative Mating 
IN Man for Stature 

(Data of Figure 18) 



contour lines, or isograms, have been drawn joining the points of equal 
height above the base. The solid in perspective and its top elevation 
showing only the contours are reproduced' in Figs. 20 and 21, 
respectively. 

The remarkable and beautiful simplicity of the representation of a 
bivariate distribution of frequency given in Fig. 21 must surely appeal to 
all. It is precisely analogous to that first uncovered by Francis Gallon in 
his study of inheritance of stature in man, which provided the foundation 
Upon which his brilliant pioneering mind built an enduring edifice. Just 
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as the scatter diagram portrays any actual bivariate frequency distribu- 
tion satisfactorily in two dimensions, so the contour diagram may trace 
the same data in graduated and more readily assimilable form. Different 
intensities of association between paired variables will show up with 
remarkable clarity in terms of such contour diagrams. Complete sys- 
tems characterized by common elements spring immediately into view, 
and it was upon one such system that Francis Galton focused his analyt- 
ical power. It was the system defined by the normal surface in its transi- 
tion from no correlation at all to perfect association of the two variables. 


FIGURE 20 

Pebspectivb View of the Nohmal Frequency Soud Geaduatino Figure 19 



In drawing the foregoing illustrations we have assumed the assorta- 
tive mating data to belong to this system, and to the accuracy of that 
assumption reasonable challenge is not likely to be given. The normal 
bivariate frequency distribution is characterized by two readily recogniz- 
able and simple features: 

(1) normal distribution of both variables considered singly, and 

(2) straight-line average dependence of one variable on the other. 
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The univariate distributions for stature in husband and wife, which may 
readily be plotted from the marginal columns of Fig. 18, are graduated 
very well by normal curves. Questions of the average interdependence 
of these two variables raise a new issue. On the average, how does the 


FIGURE 21 

Contour Diagram of thb Normal Surface Fitted to the Data of Figure 17 



60 62 64 66 68 70 72 74 

Stature of Husband (in inches) 


stature of the wife change with that of the husband? The small circles 
in Fig. 17 locate for the observed data the average stature of wife within 
each successive inch range of stature of husband. The trend is obviously 
of a straight-line nature. A corresponding plot for change in husband^s 
stature with that of the wife, given by the small crosses, shows the same 
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straight-line trend. Determination of these graduating straight lines will 
be reserved for the next chapter; meantime we are concerned solely with 
the demonstration of rectilinear average dependence of each variable on 
the other given by the located bands of averages. 

Just as the ellipse is to be regarded as the transitional figure between a 
circle and a straight line, so the elliptical contours of Fig. 21 may be 
regarded as representing an intermediate stage between total absence of 
association (given by a system of circular contours) and perfect depend- 
ence between two variables given by a straight-line relationship. Surely 
this degree of transition may be expressed by a pure number on a’simple 
scale ! So the concept of a coefficient of correlation was nurtured in Gal- 
ton’s mind. A measure of partial association ! The prospect this opened 
to his searching vision is aptly portrayed in his words : 

This part of the inquiry may be said to run along a road on a 
high level, that affords wide views in unexpected directions, and from 
which easy descents may be made to totally different goals. ... I 
have a great subject to write upon, but feel keenly my literary inca- 
pacity to make it easily intelligible without sacrificing accuracy and 
thoroughness. 

The correlation scale selected by Galton covered the limited range 
from zero to unity, the coefficient bearing a positive or negative sign as 
the one character varied directly or inversely with its correlated char- 
acter. This scale of elemental simplicity has persisted without challenge. 
A coefficient of zero means that there is no interrelationship at all between 
the two variables. Knowledge of the magnitude of one does not lead to 
any knowledge whatsoever of the largeness or smallness of the other. On 
the other hand, a correlation coefficient of unity means that when the 
measure of one character is known the other is also fully defined. Inter- 
mediate values designate degrees of intensity of association within these 
natural limits. 

DERIVATION OF A MEASURE OF ASSOCIATION 

Let X and y designate two associated variables for which it is desired 
to measure the intensity of correlation. They may represent two mea- 
sures of different characters on the one individual, or measures of the 
same or different characters on different individuals, but they should 
certainly arise for consideration as pairs for some logical reason. If the 
number of such paired observations is N, then the series may be de- 
scribed algebraically as 

Xif X2, Xd, ■ ■ Xff, 

Vh 2 / 2 , 2 / 3 , Vn- 
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For purposes of giving explicit form to our general concept of correla- 
tion, let it be assumed by way of example that y has a perfect positive 
correlation with x. By this we mean that a large value of x has associated 
with it a correspondingly large value of y, and vice versa. The notion need- 
ing explicit definition is that of corresponding largeness or smallness. The 
magnitude of each individual measurement is to be evaluated for its 
relative largeness or smallness in comparison with the whole set of mea- 
sures to which it belongs. All the necessary descriptions to enable this 
are at hand. The relative deviate measures precisely the relative large- 
ness of the individual in its own group. 

The scale of relative deviates has already received some consideration. 
We have observed its pure number character, giving universal compar- 
ability to its units. Also, the mean and standard deviation of the relative 
deviates comprising any group of variates are always zero and unity, 
respectively. Correlation analysis brings us now to paired values of rela- 
tive deviates, each member of the pair coming from its own group or 
variable. 

Transferring symbolism to the k notation used previously, wherein 
kx = , and ky = , 

Sx Sy 

it will be recognized immediately that any system of N paired values in 
which kx equals ky within every pair is a system of perfect correlation. 
Whatever the magnitude of one variable may be, the associated value of 
the other has precisely the same relative magnitude. If the signs are 
alike within the pairs, the correlation is spoken of as positive; if they are 
opposed, it Ls designated as negative. The problem of measuring correla- 
tion in general, as one may appreciate it at this point, may be considered 
as a problem of measuring the ^ ^degree of likeness” in magnitude and sign 
of the members of the N pairs of relative deviates which may be computed 
from the N pairs of variates. From N pairs of relative deviates a single 
value is to be derived having the following properties. If the agreement 
between these two values is perfect in every pair, then it is desired to 
describe the association by a numerical value of unity. If there is no 
association whatever, then a value of zero on the scale must be awarded. 
Intermediate degrees of association must be portrayed by values between 
these two limits. If the paired relative deviates tend to have the same 
sign, the association may be designated as positive; if unlike, then 
negative. 

It is an obvious prerequisite that, if a single numerical quantity de- 
scriptive of the correlation between a set of N paired magnitudes is to be 
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derived from those N pairs, then the pairs must first be fused by some 
appropriate method before an average is secured. The simplest method 
of fusion which proves fruitful is that of multiplying the paired relative 
deviates together. Adding these products and dividing by N to secure an 
average, we have 


2 

X — X 

y - y 


- Sx - 

- - 


N N 


1 - x)(y - y) 1 

L iV J ’ ^ ^ 

The standard deviations, being constants, may be placed outside the sum- 
mation sign, leaving the mean product of the deviates as the primary 
variable in the function. In analyzing the descriptive value of the mean 
product of the relative deviates, one may advantageously study first the 
mean product of the deviates, then later divide by the term SxSy. 

The change in nature of the mean product of the deviates may be 
scjiutinized in a very simple geometric manner. Frequency surfaces of 
different degrees of correlation may be represented by means of elliptical 
contours, one only being sufficient for each degree of correlation consid- 
ered. In Fig. 22 increasing degrees of correlation are portrayed in the 
diagrams arranged in order from {A) to (F), the former designating zero 
correlation, and the latter, perfect correlation, between the two variables. 
Lines corresponding to the mean values, x and divide each of these 
surfaces into two sections if each mean is considered alone, or into quad- 
rants when both are considered. All values of x to the right of the vertical 
mid-line will have a positive sign for x — f ; those to the left will have a 
negative sign for the deviation. Similarly, all values of y above the 
horizontal mid-line will be positive deviates; those below, negative 
deviates. 

If each quadrant is now considered individually, it will be clear that 
the product of all the paired deviates within it must have the same sign. 
As one proceeds from the upper right quadrant in a clockwise direction, 
the sign of the product in successive quadrants will be plus, minus, plus, 
and minus. Lhus, opposite quadrants have the same sign for their prod- 
ucts of deviates. The total product of the deviates for each of the succes- 
sive surfaces in Fig. 22 may be judged for its nature in a relative way by 
visual summation in terms of signs and areas of these four parts in each 
surface. 

The symmetry of surface (A), portraying complete absence of correla- 
tion between the two variables, clearly makes the total product zero in 
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that case, and therefore equation (1) takes the value zero. However, as 
the degree of correlation increases, so the proportion of the correlation 
surface in two opposite quadrants increases at the expense of that in the 
other pair of quadrants, and the total product will increase in magnitude 
to higher and higher values. But that is not all \ When the association 
between x and y is of a negative character, the ellipse must traverse the 
surface in a direction at right angles to that given in Fig. 22, and accord- 
ingly the sign of the total product will become negative while its absolute 


FIGURE 22 

Pbogrbssive Change in the Total Product Moment, 2(x -x){y-y), with 
Increasing Intensity of Correlation in Normal Surfaces 



magnitude remains unchanged. It follows directly, therefore, that the 
mean product of the deviates for normal surfaces is zero in the absence of 
correlation, and advances in magnitude away from this point progres- 
sively in a positive or negative direction as correlation of an increasingly 
positive or negative character is encountered. 

The limiting form in this transition of ellipses is given by the straight 
line in surface (F), representing perfect correlation between the two vari- 
ables. Now equation (1) clearly takes a minimum limiting value of zero 
for surface (A), Does it also assume a finite value for the other limit 
given by surface (F)? 
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For perfect correlation of the character defined, the paired relative 
deviates must equal one another. Therefore, 

2 [in?? 

Sx ~ . Sy ^ . Sx - 

N "" N 

1 S(x - x)^ 

si N 

si 

= 1 . 

Let us designate this mean product of relative deviates as r. Then the 
lower and upper limiting values of the correlation scale are given by r 
when normal correlation is absent or perfect, respectively. Also, the 
inference that r moves between these limits progressively with smooth 
transition in the degree of correlation from nothing at all to perfection 
seems undoubtedly correct. It is more satisfying, of course, to have 
explicit proof of the correctness of this inference. Such proof is given in 
the Appendix to this chapter. 

When Galton first arrived at an equivalent form of the quantity which 
we have symbolized as r, he named it the ‘^coefficient of reversion.'' 
Although he recognized fully that he was analyzing the system of normal 
surfaces of various degrees of correlation, he was thinking primarily in 
terms of his problem in natural inheritance; he had not at that time 
grasped the more general applicability of the coefficient he had devised. 
Its magnitude provided a measure of the extent to which offspring tended 
to regress toward mediocrity, and that was the immediate problem. The 
s 3 unbol r for this measure of reversion was selected by him and has been 
retained from that time. For some years it was known as "Galton's 
coefficient,” but now the more general term “coefficient of correlation” 
is commonly applied to it. Gallon's reasoning leading to its establish- 
ment followed an entirely different path from that which we have 
followed, Karl Pearson publishing in 1895 the formula 

'Ekxky 

^ J_ - x)(y - y) 

L N 
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CALCULATION OF THE CORRELATION COEFFICIENT 

It would be a tedious matter to compute a correlation coefficient using 
either of the above formulas without any modification. Deviates are 
numerically awkward quantities, and relative deviates would in addition 
be quite a nuisance to calculate. Transformation of the equations to 
variate form, or “moments about zero’^ instead of "moments about the 
means, is quite simple and provides a much more satisfactory formula 
for calculation purposes. The steps in algebraic rearrangement are as 
follows: 

Ijx - x)(y - y) ^ Xjxy - xy - xy xy) 

N ~ N 


Xxy hxy 'Lxy 


'Lxy _ 2a; 

~W~^¥ 



+ sy 


Therefore, 


N 


- yx- xy xy 




S^y 


(3) 


This formula, suggested by Harris, ^ is admirably adapted to practical 
calculation needs. 


(A) Calculation without Seriation 

Equation (3) above permits of expeditious computation of the correla- 
tion coefficient without formation of a correlation table. The necessary 
summations may be proceeded with by machine directly from the original 
data. Illustration of this may be given in terms of the data presented as 
Table 13, which shows the records^ of body weight and oxygen consump- 

^ J. Arthur Harris. The arithmetic of the product moment method of calculating 
the coefficient of correlation. Am. Nat., 44: 693-699. 1910. 

* J. Arthur Harris and Francis G- Benedict. A biometric study of basal metab- 
olism in man. Carnegie Institution of Washington. 1919. 
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tion under basal conditions for 136 men. The coefficient of correlation 
between the two variables may be determined directly from these data 
by means of equation (3). The necessary summations of x, x^, y, y^, and 

TABLE 13 

Data for Study of the Relationship between Weight of Man and Oxygen 
Consumption under Basal Conditions 


(Data from Harris and Benedict®) 


X = body weight in kilograms. 

y — oxygen consumption in cubic centimeters per minute. 

N = 136. 

X 

y 

X 

y 

X 

y 

X 

y 

X 

y 

74,0 

267 

88.5 

289 

79.0 

276 

78.9 

302 

82.1 

291 

82.2 

282 

108,9 

361 

62.4 

259 

66.0 

242 

74.0 

270 

71,2 

259 

73.9 

264 

63.5 

228 

63.6 

238 

56,7 

218 

56,3 

223 

75.0 

242 

64.7 

217 

60.0 

219 

59.2 

231 

58.2 

200 

50.0 

165 

49.3 

198 

55.2 

193 

50.6 

213 

55.4 

224 

59.3 

211 

85.8 

265 

83.1 

258 

83.0 

237 

82.1 

236 

78.0 

262 

75.9 

273 

75.0 

263 

74.5 

269 

74.4 

244 

74.2 

256 

73.7 

218 

73.1 

261 

73.0 

222 

71.1 

244 

69.1 

234 

68.4 

219 

67.9 

280 

66.4 

238 

66. 0 

241 

65.1 

222 

65.0 

227 

64.0 

240 

63.7 

237 

63.2 

216 

62.7 

227 

62.6 

220 

62,3 

213 

62.0 

239 

60.8 

209 

! 60.6 

225 

' 60.5 

250 

60.5 

244 

60.4 

229 

60.4 

223 

60,5 

214 

60.1 

251 

, GO.l 

233 

60.0 

229 

59.8 

245 

69.7 

251 

58.5 

193 

68.0 

233 

57.8 

192 

57.8 

213 

57.2 

232 

57.1 

220 

57.1 

216 

56.8 

244 

56.7 

238 1 

56.3 

232 

66.1 

219 

55.1 

203 

54.9 

232 

54.3 

233 

53.9 

207 

53.6 

210 

52.2 

219 

51.4 

234 

50.2 

203 

49.3 

229 

48.6 

185 I 

46.3 

173 

02.0 

273 

77.4 

294 

73.3 

247 

71.3 

265 

70.0 

269 

69.7 

285 

69.5 

248 

68.2 

237 ^ 

67.2 

247 

66.7 

228 

66. 5 

261 

65.8 

265 

64.7 

247 

64.4 

226 

64.3 

240 

61.4 

197 

59.8 

208 

58.2 

233 

57.6 

208 

57.4 

215 

56.9 

215 

66.4 

201 

55.9 

209 

54.5 

207 

54.0 

206 

49.7 

208 

49.2 

187 

33.2 

142 

83.1 

284 

74.0 

225 

71.9 

225 

69.7 

280 

68.6 

258 

68.2 

236 

67.3 

240 

65,3 

234 

63.6 

289 

61.6 

233 

61.4 

225 

60.5 

210 

58.9 

208 

57.9 

220 

55.0 

211 

53.4 

211 

74.8 

227 

60.6 

225 









60.3 

203 


xy, together with the subsequent calculations, are assembled in Table 14, 
which incidentally reproduces an arrangement of a work sheet which may 
appeal to the reader as a suitable standard form. 
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The above procedure of direct summation from basic records without 
seriation obviously depends for its usefulness on the total frequency being 
sufficiently small to make seriation hardly worth while as a means of 
shortening calculation. A machine for accumulating products is, of 
course, essential. A scatter diagram or correlation table might well have 
to be prepared as an adjunct, however, if it is desired to examine the form 

TABLE 14 


Calculation Sheet for the Coefficient of Correlation 
BETWEEN the VARIABLES OF TaBLE 13 


X = body weight. 
y = oxygen consumption. 

- 136. 

Key: H. and B. 

Adult men. 

lx = 8717.1. 

X = 64.096. 

Zx^ = 573,149.09. 

- 4,214.332. 

N 

= 105.993. 

= 10.295. 

ly - 31,818. 

y = 233.96. 

ly^ - 7,561,352. 

= 55,598.18. 

N 

si = 862.82. 

Sy = 29.37. 

Izy - 2,072,134.9. 

Ixy 

= 15,236.28. 

N 

Ixy 

-£y = 240.574, 

N 

= 302.412. . 

- +0.796. 



of the bivariate distribution and the nature of the average dependence 
lines. In some situations, of course, one knows well enough from past 
experience with the variables concerned that the frequency surface is 
reasonably normal in form. Under such conditions formation of a corre- 
lation table may be unnecessary and the above procedure may be followed. 

(B) Calculation from a Correlation Table 

To proceed to calculate a correlation coefficient as a measure of associ- 
ation without any knowledge of the type of bivariate distribution being 
dealt with is to court trouble. The correlation table provides a suitable 
basis of judgment as to the characteristics of the surface, at least as far as 
reasonable normality of distribution is concerned. Preparation of such a 
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table consumes a little time, but this will be offset in general by speedier 
calculation procedure if the total frequency is very much in excess of 100, 
and particularly if the variate magnitudes run to several digits. The 
data of Table 13, for instance, might well be arranged to advantage into a 
correlation table by an inexperienced computer. When it is recalled that 
seriation with uniform grouping opens the door at once to the use of code 


TABLE 15 

A Seriation with Broad Grouping fob the Data of Table 13 


X 

0 

1 

2 

3 

4 

5 

6 

7 



/ 

1 

6 

43 

52 

24 

8 

1 

1 

136 


360-379 

■ 

■ 

■ 

■ 

■ 

■ 

■ 

1 

1 

11 

340-359 

■ 

■ 

■ 

■ 

■ 

■ 

■ 



10 


■ 

■ 

■ 

■ 

■ 

■ 

■ 



9 






1 




1 

8 

280-299 




4 

1 

4 



9 

7 

260-279 




2 

11 

1 

1 


15 

6 

240-259 



3 

13 

6 

1 



23 

5 

220-239 


1 

13 

23 

4 

2 



43 

4 

200-219 


1 

23 

9 

1 




' 34 

3 

180-199 


3 

3 

1 





7 : 

2 

160-179 


1 

1 






2 

1 

140-156 

1 








1 
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30- 39.9 
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scales, enabling quick computation even without a calculating machine, 
the advantages of the correlation table become considerably enhanced. 

A correlation table may be prepared expeditiously from record cards 
once the grouping scheme for each variable is decided. The cards may 
be seriated first for one variable, then reseriated for the second variable 
within the classes of the first. The frequencies may be entered progres- 
sively in a blank correlation table previously prepared. The seriations 

IContinue on page 102] 
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EXPLANATION OF TABLE 16 

In Table 16 the extension columns (3) and (4), also (9) and (10)^ pro- 
vide the sums within classes for the hrst and second moments for x and y . 
Their totals are transferred as entries in Table 17. 

Columns (5) and (11) are formed in each case from accumulation of 
the products of the cell frequencies in the array and the variable scale 
values parallel to it. It gives the sum of the x values for all individuals 
having a fixed y value, and vice versa. Designations of the type are 
therefore used. By way of illustration, the vertical array of total fre- 
quency 8 may be located in the correlation table. AU these individuals 
have a body-weight value of 5 units on the code scale, or 84.95 pounds in 
original units of measurement. Their total oxygen consumption in code 
scale units is (4 X 7) + (1 X 6) + (1 X 5) + (2 X 4), or 47 units as 
entered in column (11). Cell entries in column (11) give such totals. 
Those in column (5) give the total body weight for each class of oxygen 
consumption. The totals of these columns, (5) and (11), must agree with 
the totals of columns (9) and (3) respectively if the calculations are cor- 
rect, as may readily be seen by considering just what the totals describe. 

Multiplication of each entry in column (5) with the corresponding 
entry in column (2) yields the total product of x and y values for the 
array of reference in the correlation table. These products form column 
(6), and the analogous products working the other way across the table 
form column (12). The sum of column (6), giving the total “product 
moment’’ for the whole surface, should, of course, be identical with the 
sum of column (12). The duplicate result provides an independent 
check. This total is transferred to Table 17 as Sxy, 

If a calciflating machine is being used, columns (4), (10), (6), and (12) 
need not be included in the table extensions. The sum of each of these 
columns is all that is needed, and each may be accumulated directly on 
the machine from the contributory multiplications. This reduces the 
extension to the correlation table to three columns in each direction, the 
first of which is written in directly, leaving only the other two each way to 
be calculated. 

This systematic procedure in securing the summations seems to the 
writer to be elegant in its simplicity and in its provision of checks. It 
may be mastered in a single trial, and completely independent verifica- 
tions are available for 2x, Sy, and Sxy. Moreover, if the entries in 
column (5) are divided by those in column (1), the mean values of x for 
the successive classes of y are available for examination of the form of 
average dependence. Likewise the quotients of entries in columns (11) 
and (7) give the average dependence of y on x. 
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should always be checked at each step, then finally verified by reseri- 
ating the variables ip reverse order. In the absence of cards the tally- 
stroke method^ of Table 5 may be used, tallying directly into the 
appropriate “cells^' of the correlation table. 

A seriation of the data of Table 13 into very broad groups is given in 
Table 15. Code scales have been inserted directly alongside the mar- 
ginal frequency columns. If the correlation coefficient between the 
two variables is determined now through use of the code scales, what 
transformation equation will be necessary to identify the coefficient with 
the original scales? The reader may note immediately that, unlike the 
mean and standard deviation, the correlation coefficient is a pure number 
on a universal scale and not a quantity on the scale of any xoiy variable. 
The correlation between x and y is surely not affected by our coding 
scheme ! 

x-x h{x~x) 


The relative deviate corresponding to each variate magnitude is just the 
same on the code scale as on the original scale. Therefore, 


~ir 


N 


and 


Txv 


2xy 

If 


— XY 




(4) 


Five total ‘‘moments about zero” need to be derived from the correla- 
tion table in application of this formula. They are: 2x, 2 y, Sx^, Sy^, 
and the total “product moment” Exy. Table 16 sets forth all the needed 
steps in securing these total moments. The work sheet giving the derived 
statistics on the code scales, with final transformation to the true scales, 
is reproduced as Table 17. 


Vide page 22. 
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TABLE 17 


Calculation Sheet pon DETEEMimNu the Coefficient of Cobrelation 
FROM THE Data of Table 16 

X — body weight (in kilograms). 

y — oxygen consumption (in cubic centimeters per minute). 
A = 136. 

Key: H. and B. 
Adult men. 

2x = 397 

X = 2.9191 

X = 64.141 

2x2 ^ 1315 

y-x -2 

— = 9.6691 

N 


4 = 1.1479 

= 1.0714 

5* = 10.714 

2t = 577 

Y = 4.2426 

g = 234.353 

2t2 2765 

2y2 

- 20.3309 

N 


4 = 2,3308 

Sr = 1.5267 

Sy = 30.534 

2xt = 1846 

2xy 

— = 13.5735 

N 


2xy _ 

- XY = 1.1887 

N 

- 1.6357 

= + 0.727 


GROUPING EFFECTS 

Seriation into the very broad classes selected in Table 15 was made 
primarily for the purpose of economizing space. It serves incidentally to 
demonstrate the effects of grouping on statistics. It will be noted that 
the correlation coefficient derived from the correlation table is -f 0.727, 
whereas that* secured from the ungrouped data is +0.796. This dis- 
crepancy is not due to coding but is a result of using very coarse grouping 
in the table. The values for the means, the standard deviations, and the 
correlation coefficient in Tables 14 and 17 are assembled in Table 17a. 
It will be observed that, though all show some shift in magnitude, the 
means are affected very little. Indeed, the shift in the means is quite 
inconsequential. On the other hand, the standard deviations are in- 
creased approximately 4 per cent, and the correlation coefficient is 
decreased nearly 10 per cent. 

In general it may be shown that grouping introduces no bias or other 
effect of consequence on the mean, but that it does raise the standard 
deviation, this increase causing the decrease in the correlation coefficient. 
A correction factor for adjustment of the standard deviation to allow for 
this grouping effect has been developed. It is known as “Sheppard’s 
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correction.” Its use calls for discriminating judgment of a sort rather 
beyond attainment by means of elementary discussion, and the writer 
will therefore not attempt to present it. Situations that commend broad 
grouping are in general situations that will benefit rather than suffer any 
loss through the standard deviation being somewhat higher and the corre- 
lation coefficient somewhat lower than the values to which finer classifi- 


TABLE 17a 

Comparison of Statistics Derived from Grouped and Ungrouped Data 
(Statistics from Tables 14 and 17) 


Statistic 

Value without 
grouping 

Value with 
eoarse grouping 

Grouping effect 

X 

64.10 

64.19 

Increase 0 . 1 per cent 

y 

234.0 

234.9 

“ 0.4 “ “ 

Sx 

10.30 

10.71 

4.0 '' “ 

Sy 

29.37 

30.53 

n 3 g u ^ 

Txy 

+0.796 

+0.727 

Decrease 9 . 5 “ “ 


cation would lead. It is better to establish a grouping that is reasonably 
fine than to apply any theoretical adjustment to the correlation coeffi- 
cient. Grouping gives a slight conservative bias which is not without its 
advantages. 


THE NORMAL BIVARIATE FREQUENCY DISTRIBUTION 

The correlation coefficient assumes its maximum utility as a descrip- 
tive statistic when the bivariate frequency distribution for which it is 
calculated is normal in form. Indeed, it is only under such circumstances 
that Txy, when added to knowledge of x, y, Sj, and Sy, completes description 
of the association in all its details. This may be appreciated through 
inspection of the equation for the normal surface: 


w; - Ce-K 

wherein w = the ordinate at any point {x^ y), 
C ~ {2Trs^Sy Vl — and 






2r, 


{x - x){y - y) 


( 5 ) 


One is not interested here in this rather complicated equation beyond 
noting that the two means, the two standard deviations, and the correla- 
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tion coefficient collectively define a normal surface. The correlation 
coefficient may be regarded as the link which unites two univariate 
normal distributions to give the surface of joint association of those two 
variables, 'provided that surface also is normal Calculation of the correla- 
tion coefficient for any surface which is not normal must at best give only 
a partial description of the surface. Immensely useful as the correlation 
coefficient is when properly applied, it is by no means a universally com- 
parable measure of association, not even when the average dependence 
is rectilinear. Biologists would do well to learn that this coefficient taken 
alone does not fully describe any association. Before embarking on 
discussion of such matters, one may well study further the trend lines, 
and the variation about them, which characterize the normal surface. 



APPENDIX TO CHAPTER 7 

Proof that r^y May Take Values Only Between the Limits of 
Plus One and Minus One 

Bivariate frequency systems in which the paired values of x and y have 
precisely the same relative deviate values are systems of perfect straight-line 
correlation between the two variables. All other systems must fall into either 
of the following categories: 

(1) All (x,y) points fall on a line which is curved. This is perfect corre- 
lation of a curvilinear character. , 

(2) The {x,y) points are scattered about some trend line, either straight 
or curved. This represents imjierfect correlation of some form. 

In either of these situations the relative deviates of the paired x and y values 
cannot be identical. The greater the departure from perfect rectilinear corre- 
lation, the more in general will the paired relative deviates tend to differ from 
one another. 

Let d designate the difference between each pair of relative deviates in a 
fixed sense. Then d may be defined algebraically as 



= hx — 


The mean value of d, however, will always be zero, for the mean value of each 
complete set of relative deviates is zero. 

5 ~ ^ 0 

N ~ 'Y ' 

From this it follows directly that, the greater tlie individual values of d tend to 
become, some being negative and some positive, tlie greater the variation in d. 
In measuring this variation by the standard deviation of d, it follows directly 
that 

2 ^kx - kyY 

Si = 

A A 

N ' N " N ' 
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But the mean square of any complete set of relative deviates is itself the squared 
standard deviation of the set, which is always unity. Also, the mean product 
of any complete set of paired relative deviates is r. Therefore, 

sl= 1 -2r., + 1 

= 2(1 -rj- 

Rearranging this equation, it follows that 

Txy = I - | 4 . 

Since sj is zero only when x and ^ are perfectly correlated in straight-line form, 
and under all other conditions 4 is a real positive quantity because d then 
becomes a variable quantity, r^y must always be less than unity under both 
conditions (1) and (2) above. 

The sign of a correlation coefficient is quite independent of the numerical 
value of the coefficient. It may be reversed at will by reversing the signs of the 
relative deviates of one variable throughout. Therefore it follows also that 
cannot be less than minus one. 

The numerical scale of r for curvilinear systems may readily be inferred from 
the above to be from zero to some value less than one. That latter limiting 
value may be demonstrated by geometric drawings to fall increasingly below 
unity as curvilinearity increases. Indeed, the upper limit may itself be zero 
under certain conditions, regardless of the intensity of adherence of the scatter 
to the line. Thus has but little meaning as applied to a curvilinear system 
unless the upper limit of r for the system is established. For rectilinear sys- 
tems, the upper limit is, of course, known to be unity. 
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RECTILINEAR REGRESSION 

It having been recognized that certain variables tend to move 
together in their magnitudes, or are correlated in their variations, there 
is no more natural reaction than to seek understanding of what may be 
expected for the magnitude of one of them when any magnitude of the 
other is specified. Such understanding is comparatively easy of achieve- 
ment when the variables are very highly correlated and are either sus- 
ceptible of experimental control or assume values which, in the natural 
course of events, cover a tangible range of measurement. In such cases 
a simple plot of associated values for the two variables immediately 
establishes a trend of linear form. The elementary gas laws of chemistry 
and the more obvious principles of mechanics owe their early formula- 
tion to just such procedures of analysis on the part of scientists who were 
obliged to work with relatively crude measuring devices. Deviations 
from perfectly linear interrelationships were ascribed to errors of meas- 
urement, and the selected graduating lines were accepted as reflections of 
the true relationships. It has since been learned with more sensitive 
measuring devices that such deviations are not always entirely errors of 
measurement, that the correlations are not in fact as perfect as the first 
statements of the laws supposed, and that residual variations may often 
be accounted for through consideration of other variables whose influ- 
ence was not at first recognized. The ideas of A ‘‘causing” B have been 
replaced by others involving the concept of B being more or less highly 
correlated with A, but this has not detracted from the importance of the 
central trend line in each case. As far as A may be predicted from B 
alone, the trend line continues to provide the best estimate. 

Associated variations are in general much less perfect for biological 
variables than those commonly found in the fields of physics, chemistry, 
et cetera, with their highly controllable experimental situations. This is 
naturally so, for usually a great number of variables interact to pro- 
duce biological phenomena, the influence of only the more important and 
better understood of them being recognized. However, the consequence 
of very imperfect correlations does not forbid progress along the lines of 
determining what is the most likely value of the variable B so far as it 
may be determined from any magnitude of the variable A with which it 
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is correlated. It is to these graduating lines for predicting one ^ipiable 
from another that we shall now give attention. As before, analysis will 
be confined in this elementary study to the simplest situations, namely, 
those wherein the line itself is of the simplest type — a straight line. 

To go back to Fig. 17/ it will be noted that, although the correlation 
between stature of husband and wife is far from perfect, on the average 
there appears to be a straight-line increase in wife^s stature as husband’s 
stature increases. The calculated averages of y in successive classes of 
X were plotted and joined in that diagram to reveal this feature. In 
analogous manner the averages of x for classes of y were determined and 
plotted. Both these lines, together with their rectilinear graduations, 

FIGURE 23 

Rkgression Lines for the Data on AsRORTATr\TO Mating in Figure 17 



60 62 64 66 68 70 72 74 


Stature of husband in inches 

are reproduced in Fig. 23 apart from the scatter of individual cases. 
Such lines are generally known among statisticians as regression lines, 
following the terminology used by Galton in his inheritance studies. 
Since division of a variable into classes for purposes of getting a succes- 
sion of averages to establish such a trend is a purely empirical procedure, 
the irregular line formed by joining the sequence of average points is 
referred to as an empirical regression line. Note, however, that though 
changes in the classification will change the irregularities, they will not 
alter the underlying trend in the data. That trend in the present illus- 
tration is most reasonably a straight line in each case. These graduat- 
ing lines may be designated for purposes of distinction as the theoretical 
regression lines. 


Vide page 86. 
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REGRESSIONS FOR THE NORMAL SURFACE 
In Fig. 24, normal surfaces of several successive degrees of correla- 
tion are represented by elliptical contours. In order that the surfaces 
may be directly comparable one to the other, they have been plotted in 
terms of the relative deviate as the unit of measurement throughout. 
That is, Sx equals $y as a linear dimension in every diagram, or the “axes 
of symmetry” of the ellipses are at 45 degrees to the axes of abscissas and 
ordinates. The correlation coefficient increases from zero in panel (A) 
to unity in panel (F). The rectilinear regressions have been drawn in 
each case, and it will be observed that they accord throughout with the 
mid-points of arrays established in the appropriate direction. 

FIGURE 24 

The Change in Slope of the Regression Lines fob Normal Surfaces of 
Progressively Greater Correlation 


(A) (B) 



X £C 

— Regression of yon x 
••Regression ofxony 


From this geometric presentation it will be seen: 

(1) that straight regression lines apparently provide the correct 
type of graduation of array averages for normal surfaces; 

(2) that there must be two regression lines for every surface of 
imperfect correlation; 
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(3) that these regression lines appear to intersect at the point 
(j, y) for every surface; and 

(4) that the slopes of the two lines on each surface, taken with 
respect to their appropriate axes, are identical. 

Closer study will suggest as a fifth feature that a simple relationship 
exists between the slope of each regression line and the magnitude of the 
correlation coefficient. From an inspection of panel (A) it will be clear 
that the average value of y for each value of x is the mean of !/ as a whole. 
It is a fixed value and does not change with x. Also, the mean value of x 
for each value of y is x. The regression lines coincide with the axes of 
means, and are without slope in each case. As the degree of correlation 
increases, so these regression lines revolve away from the axes of means 
and toward each other until, for perfect correlation, they coincide. 
The slope of each of these two coincident regression lines is now unity. 
The slope of each regression line varies from zero to unity under these 
conditions as r varies from zero to unity. Does the slope coincide with 
the correlation coefficient in every case? We must proceed to define 
these lines explicitly. 

Empirical regression lines for any given surface, normal or otherwise, 
may be established very readily through simple arithmetical procedures 
following classification. The method of securing the correlation coeffi- 
cient detailed in Table 16^ places one in a position to calculate the 
means of arrays directly, as is indicated in the explanatory legend facing 
the table.* These means may then be plotted and joined to form the 
empirical regression line as for the assortative mating data in Fig. 23. 
One may then judge whether or not a straight line will graduate the 
empirical averages satisfactorily. Let us assume that that judgment 
has been made for a particular case and the straight theoretical regres- 
sion line concept is accepted for the data in hand. It is then desired to 
fit a straight line to the data. It will now be established algebraically 
that the line of “best fit” will pass through the intersection of means 
with slope r when the relative deviates form the units of measurement. 
In so doing, acceptable principles for determining what comprises a 
“best-fitting” line may be established. 

THE DEPENDENT VARIABLE 

The definitive terms “dependent” and “independent,” as applied to 
variables, are most useful in discussing associations. Prior to under- 
taking our special problem a word or two about these terms may be help- 

2 Vide page 100. * Vide page 101. 
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ful. As applied in the present connection, they are of mathematical 
origin, and have their customary meaning in the sense that, from a point 
of view of mathematical analysis of the association, one of the variables 
is being regarded as dependent for its magnitude on the other. The 
other variable is designated as independent by contrast. This purely 
mathematical usage of the terms should not be construed as meaning 
dependence in the practical sense. Stature of wife or husband may 
hardly be claimed to depend on that of the other member to the alliance. 
In the statistical analysis of the data in Fig. 17,^ however, we may choose 
to regard one variable as a function of, or dependent upon, the other in 
one approach, and then equally well reverse the arrangement in a sec- 
ond approach to establish both regression lines. 

In some associations such as this, it is biologically sensible to consider 
either variable as a function of the other. More generally, however, one 
variable naturally forms the dependent one as far as logical analysis is 
concerned. The weight of the new-born infant may logically be regarded 
as a function of its mother’s weight, but the inverse arrangement is 
devoid of any such logical meaning. Regression analysis of biological 
associations usually proceeds for such reasons in one direction only. 
The mistake should not be made, however, of losing sight of the fact 
that any bivariate frequency surface not characterized by perfect cor- 
relation has two regression lines, each one serving for prediction in only 
one direction. 

REGRESSION IN TERMS OF RELATIVE DEVIATES 

The equation to any straight line may be written in the general form 

Y = a-\-hXi (1) 

where Y is the dependent variable and a and 6 are constants fixing the 
position of the line. The capital letter is used for the dependent vari- 
able here to characterize the theoretical value given by the equation, as 
opposed to the observed values y found associated with any particular 
value of the independent variable x. When x equals zero, then y equals 
the constant a, and therefore a fixes a point through which the line 
passes. Each unit of increase in x also causes h units of increase in y. 
Therefore the constant b fixes the slope of the line. 

Let the contour in Fig. 25 designate a correlation surface to which it 
is desired to fit the regression of ^ on x as a straight line such as RL. 
For purposes of simplicity in derivation of the constants for this line we 
shall find it convenient to transform to the relative deviate scales for the 


^ Vide page 86. 
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two variables. The transformation is quite comparable to that used in 
our coding scheme. Bold-face type will now be used for all quantities 
referred to these scales. The line RL may then be defined by the equa- 

Y = a + bx, (2) 


the constants a and b being so chosen that the line is the ^Test-fitting" 
straight line that may be inscribed on the surface for the regression of 
y on X. Let the point k represent any individual dot in the swarm 
which is represented as a whole by the elliptical contour. The coordin- 
ates X and y will be appropriate to this general point k. 


FIGURE 25 

CooRnrNATES or Any PorNT fc iK a Normal Surface, Scaled in Relative 
Deviates, in Relation to the Regression of y on x 



The simplest requirement to be demanded of such a line of “best fit" 
would be that the positive deviations of observed values from the hne 
should, in their totality, be equal to the negative deviations. In other 
words, the sum of all the deviations from the line shall be zero. If the 
line is to be a true moving average, then this condition must be fulfilled. 

Now 2:(y - Y) = Sfy - (a + bx)] 

= 2y - Na - bSx. 

But 2y and 2x are both equal to zero. Since is a real number, then 
a must be made zero if the whole expression is to equal zero, thus ful- 
filling the requirement of good fit. Returning to equation (2), it will be 
observed that a is the value of Y when x is equal to zero, so that the 
requirement merely demands that the line shall pass through the point 
(x - 0, Y = 0), which is the same point as [x, y) when referred to the 
original scales of measurement. 
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Any line passing through the intersection of means fulfills the first 
requirement of good fit. Of this infinity of lines there is, of course, just 
one of ‘‘best fit,” and its slope b must now be fixed by some further 
specification. It is necessary to define just what is meant by a “best- 
fitting” line. This line must surely be definable in some way by the 
statement that it is the one of closest approximation to the swarm; that 
is, it is the line from which the points in the swarm deviate the least. 
All the deviations y — Y must be considered positive values in this 
connection, which may be achieved either by disregarding their signs or 
by using the second powers instead of the first. The much greater 
mathematical simplicity of solution of such problems in general when 
squaring is involved rather than ignoring the signs has been commented 
on previously in these pages. Let us then express our requirement in 
this technical form: The line of “best fit” is that line from which the 
mean square deviation of the observed values is a minimum, 

Since Y = bx, a having been set equal to zero, 


N 


2(y - bx)2 


N 


2y2 

N 



+ b2 


2x2 

n" 


But the mean square of the relative deviates, x and y, is unity in each 
case. Also, the mean product of the relative deviates is Therefore 


2 (y - Y)^ 
N 


1 - 2br., + b2. 


Adding and subtracting to form a binomial, the quantity to become 
minimized through selection of the appropriate value of b becomes 


s(y - Y)2 

N 


1 ~ + - 2br,j, + 

(1 - rl,) + - b)2. 


Clearly the objective will be achieved if b is set equal to r^y. The regres- 
sion line of “best fit by least squares” will pass through the intersection 
of means with slope equal to the correlation coefficient, when the vari- 
ables are scaled in terms of relative deviates. 


Y = r,,x. (3) 

Also, the mean square deviation of the observed values about the regres- 
sion line under the same conditions is 


2(y - Y)2 


- 1 


N 


( 4 ) 
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THE TRUE SCALE EQUATIONS 

This regression equation may be transferred from the scale of rela- 
tive deviates to the original scales of measurement of the variables by 


direct 

substitution. 

y = 1 

'"xp X, 

that is 

Y-y 

Sy 

X - : 

or 

II 

1 

St 

and 

Y = 

y “ Tt: 

Thus 

Y = a-{-bx, 

where 

b - 1 

Sy 

~ } 

Sx 

and 

I 

II 




( 5 ) 


Similarly the rectilinear regression of x on y will be given by the 
equations 

X = a by, 


where 

and 


h ~ Try - , 


a = X ~ by. 


( 6 ) 


It is important to recognize in both sets of equations (5) and (6) that 
Tij, is a quantity with sign; b will always be negative when r is negative, 
and a will then become the sum of two numbers. 


THE REGRESSION COEFFICIENT 

The slope h of each regression line, expressing in the concrete units of 
actual measurement the average change in the dependent variable for 
each unit of change in the other, is known as the regression coefficient 
It is given in each case by the product of the correlation coefficient and 
the appropriate quotient of the standard deviations of the two variables, 
that of the dependent variable forming the numerator. 

Even more than the correlation coefficient, the regression coefficient 
is a statistic of great descriptive value. Whereas the former measures 
intensity of association on a relative scale, the latter measures average 
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change in terms of the scales employed for the measurement of the 
variables. The regression coefficient is dependent for its validity solely 
on the existence of a straight-line average relationship of the dependent 
to the independent variable, 

FIGURE 26 

The Regbession of Basal Metabolism on Body Weight for 136 Normal Men 
Code scale 



Body weight in kilograms 


CALCULATION PROCEDURE 

The computational steps in determining the constants for a rectilin- 
ear regression line are very simple when the correlation coefficient for 
the surface has previously been determined. For purposes of illustra- 
tion we reproduce in Fig. 26 the scatter diagram for the association of 
basal metabolism with weight of subject for 136 normal men (see Tables 
13 and 15 of previous chapter). One may well be concerned in ascer- 
taining from such data the most likely oxygen consumption for the suc- 
cessive weights. Calculations leading to the desired empirical and 
theoretical regression lines are given in Table 18 with its appended work 
sheet. The first three columns of the table, and the means, standard 
deviations, and correlation coefficient, are copied from Tables 16 and 17 

* For basic data, see Table 13, page 97. 
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respectively. The very few steps to be taken in securing the regression 
constants a and h may be noted in particular. The derived regression 
lines are given in Fig. 26^ wherein the set of empirical averages is plotted 
by small circles which are joined by broken lines to indicate the contin- 
uity. In order to draw the theoretical regression line it is necessary to 


TABLE 18 

Calculations for the Regression of Oxygen Consumption on Body Weight 
(Data from Tables 16 and 17) 


Em,pincal regression 


X = body weight. 
y = oxygen consumption. 


Code, scale values 

Original scale values 

X 

/ 

ZYx 


X 

Vx 

0 

1 

0 

0 

35 

150 

1 

6 

14 

2,33 

‘45 

197 

2 

43 

143 

3.33 

55 

217 

3 

52 

226 

4.35 

65 

237 

4 

24 

130 

5.42 

75 

258 

5 

8 

47 

5.88 

85 

268 

6 

i 

6 

6.00 

95 

270 

7 

1 

11 

11.00 

105 

370 


Tkeorelical regression 

X - 64.141. 
* S;, = 10.714. 

Sji 

- - 2.850. 


^ + 0.727. 


y = 234.353. 
Sy = 30.534. 


i, ^ = 2.072. 


a = y — hx = 101.45. 


r = 101.45 + 2.072X. 
When X = 35, then Y = 174.4. 

When z = 105, then Y = 319,4. 


plot two points determined by its equation. Greater accuracy in plot- 
ting this line will be secured by taking two points far apart. It is there- 
fore most useful to substitute in the equation a small and a large value of 
the independent variable, as has been done in Table 18. 

As a partial check on the accuracy of the computations, note that the 
line as drawn passes through the point (x, ^), as is demanded by theory. 
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RECTILINEAR REGRESSION 


The agreement by visual inspection of the empirical and theoretical 
lines gives an approximate verification of the correctness of the slope of 
the theoretical line. Complete verification must, of course, be given 
by thorough checking of the arithmetic itself. 

Further illustration of the application of rectilinear regression lines 
may be given with data from the field of educational statistics. For this 
purpose we shall now study the correlation between the marks secured 
by 449 students in the “science” and “current affairs” tests already 

FIGURE 27 

Correlation Table, Regression Lines, and Equivalent Score Line ( ES ), for 
Marks Secured on Two Parts of a Reflective Thinking Examination 



referred to at the close of Chapter 6. In the earlier analysis these two 
distributions were related to one another for the purpose of defining 
scores of equivalent probability of occurrence. No consideration was 
given therein to the average mark actually secured in one test by the 
students receiving any given score in the other. One may now give 
attention to this problem. 

Figure 27 presents the bivariate frequency distribution for the scores 
of the 449 students in the two tests under consideration. The central 
line ES, which is drawn across the frequency table as a background, is 
that of equivalent scores from the point of view previously set forth. 
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It reproduces the line of panel (C) in Fig. 16. The line ES may now be 
observed to form the principal axis of the swarm forming the correlation 
table. Defining the paired scores of equal relative deviates, it is the line 
on which all points would fall if there happened to be perfect rectilinear 
correlation between the scores secured on each test by tlie 449 examinees. 

A glance at Fig. 27 is sufficient to show that such perfection in corre- 
lation is far from realized. What then is the most likely score on Part B 
for any given score on Part A, and vice versa? The averages of arrays 
suggest rectilinear graduations as satisfactory, and these are drawn as 
the solid fines in the figure. The broken lines parallel to the axes indicate 
the mean values of each examination as a whole. The imperfect corre- 
lation between the two scores has resulted in the lines of average depend- 
ence revolving away from the line of equivalent scores, ES (or principal 
axis of symmetry of the surface), toward the axes of means. That is, 
the most likely score on one test for a given score on the other is found 
over the whole surface to regress or move back from the equivalent score 
toward the general average of all scores, or toward mediocrity. This 
phenomenon is, of course, a conse:juence of imperfect correlation, 
the relative degree of regression being inversely proportional to the cor- 
relation coefficient. 

It was in studying the inheritance of stature in man that Galton 
first found that adult sons on the average attained statures intermediate 
between the father’s stature and the general population average. His 
correlation coefficient aimed to measure this degree of “reverting back” 
and he called it the “coefficient of reversion” as we have already noted. 
He also called the line tracing this average relationship on the scales of 
measurement the “regression line,” a term which has persisted quite 
appropriately, for there is always relative regression on the average 
toward mediocrity with imperfect correlation between variables. Trans- 
lating the phenomenon to the inheritance of intelligence, it become.s at 
once a source of hope and despair respectively for the moron and the 
genius; their offspring will tend on the average to be intermediate 
between them and the general population level in intelligence, for the 
parent-offspring correlation in intelligence falls far short of perfection. 



CHAPTER 9 

RESIDUAL VARIATION 

While the regression equation permits of the determination of the 
most likely value of a dependent variable for any specified value of the 
other, it will be recognized that this is average expectancy only, based 
on a known scatter of individual cases. Just as the fixed value y is the 
mean of a univariate frequency distribution, so the line defined by the 
equation for Y traces the mean of a frequency distribution which is al- 
ways parallel to the y axis but which moves across the [x^y) plane. For 
any particular value of an independent variable x, Y is the most likely 
value of the dependent variable in the sense that it is the expected aver- 
age for any large experience. Many values of y will be observed to 
be associated with this particular x, these y values differing among 
themselves. In a scatter diagram these observed values will be repre- 
sented by dots scattered along the y axis corresponding to x and about 
the point Y on the regression line. The deviation of each of .these 
observed values from average expectancy may be measured; it is sym- 
bolically defined in each case as ?/ — Y. There will indeed be N values 
oiy — Y for any given surface, and, since a primary requirement in fit- 
ting the line was that the total deviation, 2{y — F), should be zero, 
the mean value oiy — Y is, of course, zero. 

The amount and kind of variation of the observed values about the 
regression line are of great interest. Obviously, with high correlation 
the dispersion from the regression line is small, vanishing entirely when 
correlation is perfect. As the intensity of association decreases from this 
upper limit, the points scatter increasingly from the axis of equivalent 
values. The scatter finally reaches a maximum when correlation 
vanishes. It is this scattering which determines the extent to which 
individual values of the dependent variable may be expected to deviate 
from the most likely value given by the regression equation. 

CONSTANCY OF RESIDUAL VARIATION 

For the normal surface, and indeed not infrequently for other types 
of bivariate distribution, the scatter about each regression line is con- 
stant in amount and form from one end of the line to the other. Some 
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appreciation of the feature of constancy of form may be secured from 
inspection of Fig. 17, wherein it may be noted that the frequency dis- 
tributions of y within the successive classes of x are fairly symmetrically 
disposed about the regression line as a central value in each case. These 
distributions are indeed normal in form . 


TABLE 19 

Calculation of “Root Mean Squake Deviations” op y-Y within Classes 
OF X FOR THE DaTA OF FlO. 27 


Class 

X 

Regression 

Y 

Total squared 
deviation 

Uy-Y? 

Frequency 

/ 

R. M. S. D. 
about regression 
line 

45-49 

15.3272 

201.5526 

2 

10.0387 

50-54 

16.4677 

330.2957 

7 

6.8691 

55-69 

17.6081 

450,2102 

18 

5,0012 

60-54 

18,7486 

1,010.2245 

36 

5.2973 

65' 69 

19.8890 

1,936.6981 

59 

5. 7293 

70-74 

21.0295 

2,221 , 5082 

74 

5.4791 

75-79 

22.1699 

' 3,558.3835 

88 

6.3589 

80-84 

23.3104 

3,345.8714 

' 73 

6.7701 

85-89 

24.4508 

1,521.3338 

1 44 i 

5.8801 

90-94 

25.5913 

520,8516 

29 

4.2380 

95-99 

26.7317 , 

193.0290 j 

11 

4.1890 

100-104 

27.8722 

777.2805 

5 

12.4682 

105-109 

29.0127 

102.8645 

3 

5.8556 

All classes 

* 


16,170.1037 

449 

6,0011 


Calculations given in Table 19 of the “root mean square devia- 
tion” ^ from the regression line for the data of Fig. 27 show a fairly 
constant value for the amount of variation, the fluctuations being justly 
ascribable to the sampling errors attending the .small numbers of cases. 
This feature of constancy of scatter about the regression line is com- 
monly designated by the technical term homoscedadicity (Greek: 
homos— ih.G same; skedasis-^ scattering). Homoscedastic normal dis- 
tribution about the regression lines characterizes normal surfaces in 
general, Figs. 17 and 27 being fair examples. 

^ The standard deviation is a particular “root mean square deviation” in which 
the deviations are taken from the mean. * In our consideration herein, the deviations 
are considered from the regression-line values, Y, which graduate the observed array 
means by a straight line. 
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We have seen that the probability of occurrence of any specified 
value in a normal distribution may be determined precisely if the mean 
and standard deviation of the distribution are known. The regression 
equation provides the anticipated mean value of the dependent variable 
for any specified value of the independent variable. May the needed 
standard deviation be calculated with like simplicity? Summing the 
squared deviations for all classes of x as in the last line of Table 19, and 
dividing by iV = 449, leads to the value 6.0011 for the desired standard 
deviation. The reader will observe, however, that these computations 
are quite tedious to perform. This is not a simple method of valuing 
the general dispersion from the regression line. 

FIGURE 28 

Fbeqdency Curves of Total Variation, and Residual Variation about the 
Regression Line, for a Normal Surface 



THE RELATIONSHIP OF TO 

One may profitably turn at this point to algebraic analysis. For the 
iV values of x there are Af associated y values and AT pairs of derived 
magnitudes, Y and y ~ Y. So far as y may be predicted from x, Y is 
that value in each case. To this must be added an increment y - Y 
which is not predictable from the value of x alone.^ y — Y is therefore 
independent of x; it represents a residual part in the total magnitude of 
each y value which can be accounted for only through consideration of 
variables other than x. The fact that y is not perfectly correlated with 
X indicates at once that factors other than those reflected in x are at play 
in determining y. The scatter about the regression line portrays these 

^ Note that this increment may be positive or negative. 
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residuals, y ~ Y, which may be accumulated into a single distribution 
by projection of the points of the swarm parallel to the regression line and 
onto a y axis. An attempt is made to represent this distribution in Fig. 
28, where a normal surface with correlation coefficient of -f* 0.8 is repre- 
sented by a single contour. The normal curve on the left represents the 
y distribution, and that on the right is the — 7 distribution. Six 
points from the swarm represented by the contour are projected to 
indicate the source of each univariate frequency distribution. 

The mean value of the residuals y ~ Y \s always zero. However, 
the variation about that mean value increases from nothing at all to a 
maximum magnitude as the correlation coefficient passes from unity 
down to zero. This variation may be measured as the standard devia- 
tion of the residuals curve in Fig. 28. Now the squared standard devia- 
tion of any distribution with zero mean is the mean square of the indi- 
viduals. That is, 



■2 2(2/ - y)^ 

= A. • 

But 

F = ^ + h{x — ^), 

where 

0 = Txy-' 

Therefore 

2 Sft/ - [y + b(x 

N 


That is, 


= '-y) - 

N 

. - ~~ y)^ 2(a: - x)iy - y) Xjx x)^ 

N N ^ ^ N 

— Sy 26 

Sj, “ 2riy ~ ^xy^z^v "F ~2 

= - rl). 

Sy^Y = SyV 1 - rly. ( 2 ) 


The tedious calculations of Table 19 are indeed unnecessary. The same 
result is given by a very simple equation. In Table 19 we found by long 
arithmetic calculation that Sy^r = 6.0011. From equation (2), 


s,-r = 6.4820Vl - 0.92582 
= 6 . 0011 . 
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ERRORS OF ESTIMATE 

This standard deviation of what has so far been designated as the 
residual variation is often referred to as the “standard error of estimate.” 
For normal homoscedastic systems it defines the normal curve of error 
about the regression-line value which may be expected in predicting one 
variate magnitude from knowledge of the other. The “probable error 
of estimate” will, of course, be given by 0.6745 times 5y_y. 

It is also common practice to use the subscript symbolism yx \n 
place of ^ — Y. This is a very useful convention, for 

Sl/-K = Sy.i 

measures the variation in y which is independent of ;r; 5^. x is the stand- 
ard deviation of y for any fixed value of x. Thus a form of symbolism is 
introduced which defines all elements concerned and is readily extensible 
to other statistics as well as to any number of variables in multiple 
association. 

Just as Y defines the most likely value of y for any particular value 
of X, so (F dz 0.6745 Sj,. J defines the range within which the central 50 
per cent of actually associated values of y may be expected to occur, 
provided the distribution about the regression line is characterized by 
normality and homo seed asti city. Similarly, only 5 per cent of the 
actual values will fall outside the range defined by (F ± 1.96 Sy. J, 

It should be noted, however, that heteroscedasiicity is not uncommon 
in systems of rectilinear regression. Under such circumstance.s, Sy.^ 
is only an average value for the changing scedasticity as one passes 
along the regression line. It will not define the standard error of 
estimate about any one point Y except by coincidence. After a little 
experience, visual inspection of the scatter diagram or correlation table 
will usually suffice in judging the reasonableness of the assumption of 
“constant scatter” for most practical purposes. 

The form of variation about the regression line follows the normal 
curve quite commonly, even when the surface itself is not otherwise of 
the normal form. It is only in such cases, of course, that the standard 
error of estimate may be multiplied by factors appropriate to the normal 
curve for the establishment of probability zones about the regression 
line. In Fig. 29 an illustration is presented of the establishment of such 
zones. Although the regression in this surface is rectilinear and homo- 
scedasticity prevails, the surface itself is anormal, for the univariate 
distributions are distinctly anormal. Nevertheless, although the corre- 
lation coefficient has doubtful descriptive value taken alone, it leads 



RELATIVE jaESIDUAL VARIATION 


125 


directly to the correct regression line and error of estimate. The coarse- 
ly broken lines on either side of the regression line define the zone of 
^'equally probable error” in the prediction of diastatic activity from 
carbon dioxide production. The outer finely broken lines bound the 
zone within which 95 per cent of all cases may be expected to occur. 

FIGURE 29 

Residual Vaeiation Zones About the Regression Line in a Homo.scedastic 
System * 



RELATIVE RESIDUAL VARIATION 

It will be observed that the ratio of the error of estimate to the total 
variation in the dependent variable may be expressed as a function 
solely of the correlation coefficient. 


Sy 



Data from C. E. Davis and D. F. Worley. Cereal Cliem., 11 : 536-545. 1934. 
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RESIDUAL VARIATION 


The value of this ratio is often now called the coefficient of alienaiionf 
for it measures the proportion of the variation in the dependent variable 
which is alienated from or independent of the associated variable. This 
value is the same for any normal surface regardless of which variable is 
considered as dependent. 

C.A. = (3) 

The coefficient of alienation increases from zero to unity as the correla- 
tion coefficient decreases from unity to zero; that is, they vary mutually 
on the same scale but in opposite directions. The complement to unity 
of the coefficient of alienation therefore varies in the same sense as 
and may be used as an index of the prediction value of Thus a 
‘^predidimi index^’ may be established as 

P.f. = 1 - C.A. = 1 - Vl - r*. (4) 

It is of considerable value to study the relationship of the coefficient 
of alienation and of the prediction index to the coefficient of correlation. 
These relationships are portrayed graphically in Fig. 30, for C.A. and 
P.I. are continuous functions of r alone. The graph is a quadrant 
of a circle passing from (P.I. = 0, r — 0) to (P.I. = 1, r = 1). These 
limits form the only two points at which the correlation coefficient 
appropriately indicates prediction capacity on a correlation surface. 
Thorough appreciation of the nature of this graph is essential to correct 
understanding of the correlation coefficient as a measure of ^Hntensity of 
association” between two variables. In our preceding discussion of the 
correlation coefficient we were content to accept demonstration that 
it passed in magnitude from zero to unity as the “in tensity of corre- 
lation” moved from nothing at all to perfect rectilinear dependence. 
We are now in a position to interpret more fully the meaning of inter- 
mediate values. 

Let us study this matter further by a question-and-answer method. 

(1) Question: Does a correlation coefficient of 0.5 mean that half of the 
variation in one variable is accounted for in terms of the other variable? 

Answer: No ! When r = 0.5, sliglitly more than 86 per cent of the variation 
in the dependent variable is still unaccounted for. 

(2) Question: At what value of r is the residual variation reduced to 50 per 
cent of the total? 

Answer: r = 0.86. 

(3) Question: How high must the correlation coefficient be before only 10 per 
cent or less of the variation remains unaccounted for? 

Answer: r == 0.995 or above. 
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(4) Qiiestion: How high must r be before 10 per cent of the variation in one 
variable is accounted for in terms of the other variable? 

Answer: The correlation coefficient must be 0.436 or above. 

FIGURE 30 

Functional Relationship or the Coefficient of Alienation and the 



Correlation coefficient 


It is sometimes taken as a basis for condemnation of the correlation 
coefficient that a difference of 0.1 (or any fixed increment) between two 
values of r means an increasing difference in intensity of association, 
from the prediction point of view, as one passes from low to high values 
of r. This deficiency is regrettable, but no wholly satisfactory way 
of avoiding it by using a function of r, such as or the prediction index 
just considered, has as yet found general appeal It should be under- 
stood clearly that the correlation coefficient, as a measure of intens- 
ity of association, must be interpreted with care. Spurious interpreta- 
tion has its roots in lack of understanding by the interpreter and does 
not originate in the statistic itself. 



CHAPTER 10 


ERRORS OF RANDOM SAMPLING 

The scientist is interested in the necessarily limited number of 
individual measurements with which he must deal only so far as they 
form a basis of generalizations concerning the much larger population or 
supply of such measures in a universe of discourse. In the analysis of 
data, the biometrician therefore faces tasks beyond those of statistically 
describing samples. He will be called upon to form from that informa- 
tion an estimate of what may be expected to be true of the supply from 
which the sample or samples have arisen. He will be asked to determine 
whether descriptive statistics such as means calculated from samples of 
different origins show differences of sufficient magnitude to warrant the 
inference that the populations which they represent are differentiated 
in like manner. The question arises immediately: To what extent is 
Xf or any other such statistic calculated from N individuals randomly 
drawn as a sample from a specified supply, trustworthy as a measure of 
the mean or other corresponding value of the supply? 

It is a matter of common experience that descriptive statistics do vary 
from one random sample to another, although that variation naturally 
cannot be observed if only a single sample is drawn. It must be appar- 
ent that a precise knowledge of the amount and kind of such variation 
which may in general be expected is essential to the determination of the 
trustworthiness of each sample statistic as a basis of inference concerning 
the supply. Variation in statistics which is due to errors of random 
sampling must surely be ausceptible of description in terms of the same 
quantities as are employed for defining the distribution of variates. 
Orderliness of variation in the latter must surely suggest the same for 
the former. Is not the distribution of variates also the distribution of 
means for samples of but one individual each? 

If a sample is drawn from a set of measures, all of which are identical, 
it is immediately obvious that the sample will reproduce the supply 
precisely with respect to that measurement, regardless of the size of the 
sample drawn. A sample of one is sufficient to reproduce the invariant 
measurement. However, if another supply of measures which differ 
among themselves is used, it is equally obvious that a sample of one 
cannot reproduce the new supply. Indeed, if that supply (assumed to 
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be infinitely great in its total frequency) should embrace only 10 differ- 
ent magnitudes, one would feel it a remarkable coincidence if a sample 
of 10 individuals drawn at random reproduced all the supply measures. 
Even then the supply would not be reproduced perfectly by the sample 
unless in that supply all 10 measurements occurred with equal fre- 
quency. Thus the accuracy of determination of supply characteristics 
through analysis of a sample would seem to constitute a complex prob- 
lem, apparently dependent at least on the nature of the supply variation 
and the size of the sample drawn. 

THE RANDOM SAMPLE 

Reference has been made above to samples drawn at random from 
some supply of measures. The concept of the random sample is so 
important in statistics that repeated consideration of its nature will not 
be amiss. From the simplest games of chance to the pinnacles of 
statistical inference in science it plays a crucial role. Once a supply of 
measures characterized by variation is completely defined, it is not a 
difficult matter to devise methods of drawing a representative sample 
in the sense that it is a random one. The proportionate occurrence of 
each single magnitude in that supply defines the probability of selection 
of that magnitude in the truly random sample. Beadsf discs, or suit- 
able objects may be drawn as a sample from a receptacle in which a very 
large number of them have been thoroughly mixed or ^handomized’^; 
or numbers may be assigned to individuals in the supply and a set of 
such numbers chosen from a table of “random sampling numbers'' (such 
as that of Tippett ^ to define the sample. Such samples will differ in 
constitutioHiSolely through “errors of random sampling,” for the propor- 
tionate occurrence of each magnitude in the supply defines the prob- 
ability of its being drawn in the sample. 

On the other hand, when the frequency (Jjstribution of the variates 
in the supply itself is not known, the scientist faces a very different 
problem. A set of measurements arise through circumstances prescrib- 
ing the material for and scope of an experiment or field of observation. 
The sample is readily specified, but the supply of which it is representa- 
tive in the sense of being a random sample remains to be defined on 
logical grounds. The ease with which such I«>gic may be corrupted by 
wishful thinking undoubtedly leads quite often to erroneous definition 
of the supply. Far too much generality is likely to be introduced into 
that definition, leading to conclusions which independent tests fail to 
substantiate. 

^ Tracts for computers, No. 15. Cambridge University Press. 
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STATISTICS AND PARAMETERS 

In order to proceed with the discussion of this most important prob- 
lem it is desirable to differentiate clearly, both in name and in symbol, 
between statistics f which describe the characteristics of samples, and the 
corresponding magnitudes describing the supply. Let the latter be 
designated as 'parameters and be s 3 Tnbolized where convenient by Greek 
letters equivalent in some way to the English letters used for the statis- 
tics. Thus (T (lower-case Greek 5, called “sigma”) and.s may both stand 
for the standard deviation, but the former will refer to the value in the 
supply whereas the latter will pertain to the sample. For a given sup- 
ply, (7 is a fixed value, but s will vary from one sample to another when 
those samples are drawn at random from the supply. Likewise the cor- 
relation coefficient of a given normal bivariate supply may be designated 
by p (Greek “rho,” lower case), the coefficient for samples being symbol- 
ized by r. Then in absence of knowledge of the parameters o- and p, the 
statistics s and r will form bases of estimation of their respective values. 

Parametric designation for the mean cannot be conveniently given 
as a Greek equivalent to the “bar” above the symbol for the variable. 
One may use p as an equivalent of m for mean. We shall, however, 
wish to use p for moment coefficients of the supply. Now mean moments 
about zero as origin are often designated as m' to distinguish them from 
the moment coefficients m, for which the mean is the origin. It will be 
seen that 

'Lx , 

Lx^ 

and so forth. Omitting the subscript 1 as unnecessary and placing there 
the S 3 nnbol for the variable, p^ will form a logical parametric notation for 
the mean of the supply of a series of x values. 

From a given supply of variates z, indefinitely large in number, let 
samples of N individuals be drawn at random. Let it be assumed that 
a very large number, k, of such samples, all of the same total frequency 
N, has been drawn. Then k values of x will be available, and also k 
values of Si. The true sampling error of each sample mean and standard 
deviation, so far as those statistics form estimates of the corresponding 
parameters, may be expressed as f — p^ and respectively. One 

might anticipate that the frequency distributions of the k values of x 
and s* would have as more or less central values the parameters p^ and 
(Tx respectively. But what are the mean, standard deviation, and form 
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of the frequency distribution for each statistic? Answers to these ques- 
tions may be derived algebraically; but the complexity of the derivations 
forbids their inclusion here. Recourse will be made at this point to the 
results of an actual sampling experiment which attests quite well to 
the known theoretical relationships. 


THE SAMPLING DISTRIBUTION OF MEANS 

The supply of finger-length measurements which have already been 
used as an illustration of a normal distribution^ was prepared for sam- 
pling by assigning to each of the 3,000 measurements a number between 
1 and 3j000. These numbers were then sampled, using the table of 
Tippett [op. cit.), samples of 4 numbers each being drawn at a time. 
The measurements corresponding to these numbers then formed the 
samples, 786 of which were so drawn. In Fig. 31 the outer histogram 
gives the distribution of the supply as defined by the original records. 
The inner histogram portrays the distribution of the 786 sample means, 
grouped into classes of 1-mm. range like the original records. 

The curve superimposed on the distribution of means in Fig. 31 is 
the normal frequency distribution having the following parameters: 

(1) its mean coincides with the mean of the original measure- 
ments, 

( 1 ) 

(2) its standard deviation is equal to the standard deviation of 
the supply divided by the square root of N, the size of sample drawn, 



( 2 ) 


The fit of the normal curve to the results of the actual sampling test is 
good. This would naturally be expected if equations (1) and (2) above 
present the proper relationships and the form of distribution is normal. 
The equations may be derived algebraically and are explicitly correct. 
The normal form for the distribution of means may also be shown to be 
correct whenever the distribution of variates comprising the supply is 
normal. 

Although biological distributions in general follow I forms of curves 
with the normal law as a. central type, they very frequently show some 
more or less small degree of departure from normality. How, then, are 
means of samples drawn at random from such supplies distributed? 
Equations (1) and (2) above may be derived without any assumption 

^ Vide page 14. 
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whatever as to the form of distribution in the supply. They therefore 
hold in all random sampling situations. However, it is found that the 
skewness and kurtosis of the supply affect the same features of the dis- 
tribution of sample means. Strictly speaking, the latter distribution 
has one Nih part of the skewness and kurtosis of the supply curve when 
these features are measured in terras of the “beta coefficients.’’ From 
the practical point of view in biology this pulling away of the distribu- 
tion of means from the normal form is in general so small as to be of 
trivial consequence. Only when dealing with samples of very small 
frequency which have been drawn from variables with markedly anormal 
distribution form need there be any concern about anormality in the 
sampling distribution of means. With such situations we shall not deal 
herein; they represent special cases of comparatively rare occurrence. 
Let it be understood then in that which follows that N is assumed to be 
adequately large to render essentially normal the form of distribution of 
means arising from repeated random samples. 

FIGURE 31 

Frequency Distributions of a Sampling Supply, and of the Means of 786 
Randomly Drawn Samples of 4 Variates Each 



Finger length and mean finger length in centimeters 


The sampling error of any one mean, is defined by the deviation 
X — }JL^. If on the average with repeated sampling this deviation reduces 
to zero, then the error of sampling in means may be said to be without 
bias. The mean of a random sample then becomes defined as a statistic 
without bias. It does not tend on the average to overestimate or under- 
estimate the mean of the supply from which the sample is drawn. Note 
that this statement does not imply that averages of random samples are 
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without sampling error. There remains the element of an unbiased and 
unpredictable error which follows the normal law having a standard 
deviation equal to one “root Nih” part of the standard deviation of the 
supply. When the supply parameters are known, then the sampling 
error distribution of means is known completely. If, however, the sup- 
ply parameters are unknown and only a random sample from that sup- 
ply is available as a basis of induction, then a very different situation 
faces the analyst. It is with this very practical problem that we shall 
now deal. 

CONFIDENCE INTERVALS 

One may accept the mean zb 3 standard deviations as a range within 
which lie practically all individuals in a normal distribution. For the 
sampling distribution of means, then, /u' ± So-f specifics reasonable limits 
within which single values of x may be expected to lie, although one 
must remember that a very small proportion will actually fall outside 
even this range. If the mean of a supply is not known, but a single 
value of X given by a random sample from that supply is available, then 
all one may infer about /ii is that it may have any one in a continuous 
range of values which would yield the given value of x through errors of 
random sampling. Limiting those errors to Scr^ as a working range, if 
X falls at the tentative limit of positive error, then 

fix = X ~ 

if the error is negative, on the other hand, and again at the provisional 
limit, 

fij, = x~\- 

It is a reasonable induction that /zi lies somewhere in the range specified 
by :r zfc 3o- 2. One feels confident to a degree closely approaching cer- 
tainty that n'x does not lie outside this range. Since any such interval 
is defined by sampling probability, it is customary to specify the confi- 
dence interval by the probability of sampling on which it is founded. 
Since ztSa- in a normal distribution defines a central probability value of 
99.973 per cent, the points z ztZffx may be referred to as the 99.973 
per cent confidence limits. Intervals portraying other degrees of con- 
fidence may be established in like- manner. The limits specified by 
X ± 1.96tr5 define the 95 per cent confidence interval. In the upper 
panel of Fig. 32 the two possible values of defined by the 95 per cent 
confidence limits, and the relation of the observed x to them, are given 
geometrically. It will readily be seen that a “confidence distribution 
curve” may be established about x as mean with standard deviation 
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equal to o-j. Such a curve is drawn in the lower panel of Fig. 32. It is 
of interest to note that the most likely value of to yield me given x 
value is that of x itself. Contrasting with this is the equally interesting 
point thdt the confidence of }ix having that value is infinitesimal, as it 
must be for any single value without latitude. 

FIGURE 32 

Confidence Intervals with Respect to a Supply Mean 
Estimated from a Sample Mean x 




{A) The lower and upper values of for which the probability of x being a 
random sampling deviation is 5 per cent. 

(R) The corresponding “curve of confidence” with respect to ranges within which 
fi'x may lie. 

The reasoning leading to the establishment of confidence limits for a 
parameter, as above, is straightforward and involves no assumption. 
The idea is fundamentally very simple and the terminology acceptable. 
The student should be careful, however, not to drop into the pitfall of 
concluding that “the probability that the mean of the supply falls in 
the range x ± l.QGaj is 95 per cent.” That is quite wrong, for in all 
cases has a fixed value. It is not a variable, and the probability attach- 
ing to any one value being correct is either one or zero according to 
whether that value is right or wrong. The probability used in specify- 
ing the confidence limits pertains to the confidence in the range embrac- 
ing the parameter, not to the parameter itself lying in that range. This 
is a fine point, but a most important one to clear reasoning. 

In the preceding discussion the unknown character of pi has been 
observed, but no mention has been made of the unknown character of 
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(Ji. Equational definition requires that o-* be known before aj can be 
calculated precisely. If the mean of the supply is unknown, is it likely 
that the standard deviation of the supply would be known? Except in 
special cases the answer must be in the negative. How then tnay one 
proceed in practical situations to establish confidence limits with 
respect to the mean of a supply? It is at this point that some degree 
of approximation must inevitably be introduced if progress is to be made 
in providing an answer. The only basis of estimate of <rx provided by 
a unique sample^ is the variation present among the sample variates 
themselves. The measure of that variation is s^, and one therefore 
faces the new problem of how good an estimate of is provided by Sx. 
The answer to this problem must be secured from study of the errors of 
random sampling in standard deviations. 


THE ERRORS OF RANDOM SAMPLING IN 
STANDARD DEVIATIONS 


The sampling distribution of standard deviations has been fully 
established by mathematical derivation only for the situation where the 
variates in the supply are normally distributed. It is customary to 
accept the relationships so given as being near enough for application in 
practical work with all reasonably normal biological distributions. Cer- 
tainly dependable evidence to the contrary is conspicuously absent. 
The convention will therefore be followed herein. 

Figure 33 presents the distributions of s* for samples of size 5, 9, and 
25, randomly drawn from a normal supply of standard deviation cr*. 
The means of the sampling distributions are indicated by the inscribed 
ordinates. Two features of these distributions immediately attract 
attention, fhey are: 

(1) the distributions are noticeably skew when N is small, but 
approach symmetrical form as N increases; 

(2) the mean values of are less than Ox, but approach that 
value more closely as N increases. 


With samples having a frequency of about 30 or more, no serious error 
is likely to be introduced in biological work if it is assumed that the 
distribution of standard deviations follows the normal law defined by the 
following equations: 




and 




( 3 ) 

( 4 ) 


^ The adjective "unique” as applied to a sample in this book refers to "onlyness”; 
no otlier source of information than is provided by the sample is implied. 
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With sufficiently large samples it is acceptable then to conclude that 
the random sampling distributions of both means and standard devia- 
tions follow the normal law. In both cases the corresponding para- 
meters form the central values, but the standard deviations of the 

FIGURE 33 

Random Sampling Disteibutions of Standard Deviations for Sampies of Size 
N Drawn from a Normal Supply op Standard Deviation tr 



samphng errors differ. It is interesting to observe from equations (2) 
and (4) above that the standard deviation is the more trustworthy 
statistic in the sense that its random sampling error is less than that of 
the mean. The ratio of the random sampling error in means to that 
in standard deviations is given on the whole by 

o-j ^ (Tx _ \/2 

y/N (fx 1 

That is, one may have more confidence, in general, that Sx is a good 
estimate of Cx than that x is a good estimate of 

STANDARD ERRORS 

The question which led to consideration of the sampling error dis- 
tribution of standard deviations in the last section was one of finding 
from a unique sample a suitable estimate of the standard deviation of 
the supply from which that sample was drawn at random. Such an 
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estimate is needed in order that one may develop an acceptable measure 
of the sampling errors to which the piean itself is subject. If, in the 
equation 

o-x 

one may substitute for <rz an appropriate estimate of it, then an estimate 
will be forthcoming for ctj, 

With samples of large frequency there is but little latitude, relatively 
speaking, within which may differ from a-g. Under such circumstances 
it has become customary to consider Sj an adequate estimate of o-z for 
substitution in the right-hand member of the equation above. The 
estimate so-determined of cr^ is known as the standard error of the sample 
mean. 

S.E.. = ;^- (A^ large.) (5) 

The term “standard error” is one of wide application, and one may 
choose this occasion as suitable for emphasizing its character. It is a 
determinable estimate of the true but unknown standard deviation of 
errors of random sampling attaching to any statistic. It is formed 
by substituting an estimate of the needed parameter in the equation 
defining the true standard deviation of the statistic to which it relates. 
Thus, on the same grounds of reasoning, the standard error of the stand- 
ard deviation of a sample is given by 

“ yljy' (A' large.) (6) 

This may be compared with the true but usually indeterminate formula, 


These estimates are really inadequate with small samples. It is 
important to note again from Tig. 33 that Sg does not, on the average, 
3 deld the supply standard deviation, ctz, but a somew^hat smaller value. 
From the point of view of estimating Cg, one recognizes that is a 
negatively biased statistic. When N exceeds 30, say, this bias is entirely 
negligible. For smaller values of N, however, it seems desirable to 
correct for this bias when estimating ^g. 

Reasons for this negative bias have been set forth in Chapter 4, where 
it was suggested that division of by W - 1, rather than by N, 

represents a logical step in direct estim(0on of the supply variation. It 
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is most interesting to find that this logic, recognized by the astronomers 
since the time of Gauss, is supported by an identical correction arising 
from quite different reasoning. During recent decades, Fisher has 
vigorously urged a method of maximum likeliktod estimcUion for forming 
‘'unbiased” estimates of supply parameters from the study of finite 
samples. This method^ leads to the formula, 

N-1 ’ 



as providing the “most likely” value of to lead to the observed stand- 
ard deviation for the sample. If this estimate of is used in determining 
a standard error expression for the mean of a sample, one finds by simple 
algebra that 


S.E.i 


Vn-i’ 


(!) 


regardless of the magnitude of N* Obviously, equation (7) approaches 
equation (5) rapidly as N increases from very small values. When N 
is 30 or more the difference between the two expressions is trivial. 

Proceeding then from a unique sample, one may estimate confidence 
limits with respect to the mean of the supply by substitution in the 
expression, x ± ^(S.E.s), where k is the appropriate factor for the prob- 
ability involved, k may be secured readily from tables of the normal 
probability integral for any desired degree of confidence., It must be 
borne in mind that, just as this estimate of the confidence interval 
becomes more precise as N increases, so inversely it becomes less accept- 
able as N decreases. 


RANDOM SAMPLING DISTRIBUTION OF DIFFERENCES 
BETWEEN MEANS 

Perhaps the most common test called for in statistical analysis is that 
of determining whether the difference between the means of two samples 
is of sufficient magnitude to justify the inference that they reflect a real 
difference in the means of the supplies from which those samples have 
been drawn. For example, in a series of over 2,000 births at a certain 

* A simple exposition of this method for the mathematical reader may be found 
in the following article: W. Edwards Deming and Raymond T. Birge. On the 
statistical theory of errors. Reviews of Modern Physics, 6: 1 1 9-161. 1934. 
Reprinted with additional notes by the U. S. Dep't Agr. Graduate School, 1938. 
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hospital it is found that the boys are on the average a little over j inch 
longer than the girls. Is this small difference ascribable to sampling 
errors, or are boys really somewhat longer on the average than girls at 
birth? It is common knowledge that in adult life males exceed females 
in stature, but the smallness of this difference at birth invites careful 
consideration in relation to sampling errors. This being but one of a 
multitude of similar problems, let us consider the general problem first. 

Let Nxf X, and s* be the statistics descriptive of a first sample, and 
let Ny, y, and Sy be those of a comparable sample of different origin. Let 
the supply parameters be v*, and respectively. Then from the 
preceding discussion of sampling errors we know that x is merely an 
individual value in a normal sampling distribution of means which has 
its central value at /Ui- The y value belongs in an analogous distribution 
centered at /x' . Since x and y differ in magnitude, it is desired to ascer- 
tain from our knowledge of sampling errors in means whether it is a safe 
inference that and iiy differ in magnitude. 

For purposes of discussion let us hypothecate that the means of the 
two supplies are identical; that is, 

This hypothesis must, in fact, be either true or false. If it is true, the 
observed difference between x and y is attributable solely to errors of 
random sampling. As a test of the reasonableness of the hypothesis, 
one naturally asks : What is the probability that a difference as great 
as f — y would arise through errors of random sampling alone? A large 
probability would surely justify ascribing the difference to sampling 
effects. On*the other hand, if that probability is quite small, then one 
must decide whether adherence to the hypothesis of no difference between 
the parameters is still warranted. It may, perhaps, be more logical to 
reject the ‘‘null hypothesis” in favor of its alternative, namely that 

The statistical task is completed when the correct probability is secured. 
The interpreter must decide whether to accept or reject the hypothesis 
yielding the probability. The statistical investigation lays the founda- 
tion for logical processes of interpretation; it initiates, but most cer- 
tainly should not be construed as terminating, a continuum, in logical 
analysis. 

The situation with respect to the distribution of differences between 
means arising solely through errors of random sampling is presented 
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geometrically in Fig. 34. Curve (^) portrays the normal distribution 
of values of x based on samples of size Nx, random sampling having 
been made from a supply of mean fi[. Curve (B) similarly applies to 
means y based on samples of size drawn from a supply of mean juy. 
If, from each of these distributions (^4) and (B), one of the means they 
portray is drawn at random, and from the pair so drawn the difference 
d = X — y\s calculated, then continued repetition of this procedure will 
give a large series of differences of which the distribution parameters may 
be derived. Curve (C) portrays the curve of distribution of d, a curve 
of normal form for which the mean and standard deviation will now be 
determined by simple algebra. 


FIGURE 34 

DjsTRreuTiON of Random Sampling Differences between Elements which 
AEE Normally Distributed 





(A) Curve of the distribution of means, x 

(B) Curve of the distribution of means, f 

{€) Curve of the distribution of differences, d = x — y 

Md = — ti'v 


Let a very large number, k, of differences be assumed available. 


Then 


3 2d 2(^ -- y) 


2x 2y 

T ^ T* 


In the limit of fc —> =», obviously 


( 8 ) 


^*4 = - A*} = A*! - /<«■ 


(9) 
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Now 



m - yV 


Ux-y)y 


k L fc 


k k ^ k 


. k . 


+ ^TT' 



By rearrangement of the terms in this expression, 






2 

'Lxy Sx 

k 

. k _ 

+ T- 

, k J 

— 2 

, k k k . 


(A) + (5) - 2(C). 

Expression (A) is the difference between the mean square and the square 
of the mean of k values of x. Therefore 

(A) - 4. 

Similarly, (B) - Sj. 

Now (C) is the difference between the mean product of paired values and 
the product of their means. By rearrangement of the formula for 
the correlation coefficient between such paired values we find that 

(C) = Tiy S^Sy. 

That is, 4 = 4 + s — 2rzy s^Sy. 

In the limit of ^ > qo , these statistics become parameters of the curve 
of distribution of d. In that limit, however, correlation between the 
paired values of x and y must vanish, for they arc drawn independently 
by definition. It follows then that the standard deviation of the curve of 
differences is 

Od = ] 



Derivations of more complex character show that the third-moment 
coefficient of the d distribution is zero, and its fourth-moment coefficient 
is precisely three times the squared second-moment coefficient. The 
curve of distribution of differences between means of independent 
samples therefore follows the normal law, with parameters given by 
equations (9) and (10) above. 
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The foregoing derivations have been made without any specification 
about identity in the means of the supplies. If it is now taken as a 
special case that they are identical, that ]s fil — the distribution of 
differences between means of samples randomly drawn from those sup- 
plies win be a normal curve about zero as mean, with a standard devia- 
tion given by equation (10). In finding the probability that an observed 
difference between two means x and y is due solely to errors of random 
sampling, it is therefore not necessary to know what the common mean 
of the supplies is. Such differences are distributed normally about zero. 

The probability of any given difference (or a larger one) between the 
means of two samples drawn at random from supplies of identical mean 
would be derivable precisely from tables of the normal curve functions 
only if the supply standard deviations were known. The relative devi- 
ate of the difference in the normal distribution defining the errors of 
random sampling would be given by 


d- fj-i ^ x-y 

Oi vQTffs 


(iii 


At this point one must usually turn to estimation. Dependent on 
ff* and (Ty for their values, o-x and crj are not likely to be precisely deter- 
minable except in very special cases. In place of them one may choose 
to use the standard errors as estimates, yielding the derived equation 

S.E.i^j = V's.E.l + S.Ej- (12) 


This expression for the standard error of the difference between two 
means is an estimate of the true but unknown standard deviation of the 
errors of sampling determining such differences. Substituting S.E.^ in 
equation (12) for in equation (11), one has the approximation 




- y) 

Vs.E.l + S.E.? ’ 


(13) 


which, of course, improves as the sample frequencies Nz and Ny increase. 
From k the corresponding probability may be found from the table in 
Appendix I. 

PRACTICAL APPLICATION 

The computational steps involved in testing the null hypothesis 
applied to the means or standard deviations of largo samples are of very 
simple character. Analysis of the problem immediately following will 
exemplify this, Data secured from the archives of the Sloane Hospital, 
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New York City, for length of new-born infants of Irish parents yielded 
the following statistics : 

Sex Number Mean St. Dev. 

Male(x) 1136 61.96 cm. 2.181cm. 

Female (y) 1071 51.22 cm. 2.189 cm. 

These data may be analyzed to determine whether the sex differentiation 
with respect to mean value and variability shown by the samples justi- 
fies the inference that Irish male offspring are in general longer and 
less variable in length than females at birth. Since the standard devia- 
tions of the supplies are unknown, one must depend on standard errors, 
which for such large samples will provide excellent estimates. 


1. The difference in means 


d — (x — y) ~ 0.74 cm. 
2 ISl^ 

S-E.^ = = 0.004187. 

1136 


S.E.| 

S.E., 


2.1892 

1071 


: 0.004474. 


V0.0041S7 + 0.004474 = 0.093. 


k{ = ) 


d 

S.E.d 


0.74 

0.093 


7.96. . 


This value o? k is beyond the range of the table in Appendix I. The 
probability of such a difference being due solely to errors of sampling is 
in fact less than one in a million. There can be no doubt then that the 
observed difference between the sample means has only a most remote 
cliance of arising solely from errors of random sampling. A real differ- 
ence of like sign in the supply means is therefore indicated. Using the 
95 per cent confidence limits, one would estimate that, while 0.74 cm. is 
the most likely difference between the supply means, the difference may 
well be as low as 

0.74 - 1.96S.E.rf = 0.56 cm., 

or as high as 

0.74 + 1.96S.E.d = 0.92 cm. 


The smallness of the difference (most probably less than 1 cm.) makes 
this problem of academic rather than practical interest, of course. 
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2. The difference in variability 

With N as large as it is in each sample, the errors of sampling in the 
standard deviations must follow the normal law very closely. There- 
fore the difference between the standard deviations may be analyzed 
precisely as in the case of the means. 


d = (sy ~ Sx) = 0.008 cm. 

- IS - 
■ ““ 

S.E.j = V 0.002237 + 0.002093 = 0.066. 
^ 0.066 


From Appendix I, one finds that the probability of a larger difference 
than this arising solely through sampling errors is greater than 90 per 
cent. It would therefore seem most foolish to ascribe any significance 
to this particular difference. Indeed, wide experience with such com- 
parisons of variability in length at birth is convincing that boys are 
somewhat more variable than girls. In the preceding example it appears 
that a sampling error has probably more than offset a real difference of 
opposite sign. 

A second example may be given in terms of samples of rather small 
size. Two chemists made 20 determinations each of the* protein con- 
tent of a sample of flour sent to them for analysis. The determinations, 
with tests of the differentiation in mean and standard deviation of each 
set, are given in Table 20. Again there appears rather conclusive evi- 
dence that the difference in the mean values is not due solely to errors of 
sampling. Analyst x has recorded almost 0.1 per cent more protein 
than analyst y. Evidence that one is any more consistent in his analyses 
than the other is lacking; the standard deviations certainly do not differ 
significantly. 

One may well feel somewhat insecure about the inference that the 
difference in means is significant, because standard errors derived from 
rather small samples are likely to be poor substitutes for the true stand- 
ard deviations of random errors. The point may be tested further by 
aiming to set up a theoretical value for that standard deviation as fol- 
lows. If <r is the true error of analysis for these two workers, how large 
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would it need to be to cast doubt on the significance of the difference 
between the means? An answer to this question will require, as a pre- 
requisite, selection of some definite “level of significance, “ For demon- 
stration purposes let us choose the 5 per cent level. Then the observed 

TABLE 20 

Replicate Determinations of Protein Content of the Same Flour Sample 
BY Two Analysts, with Statistical Tests for Significance of 
Differentiation in Mean and Standard Deviation 


X 

y 

Tests of significance 

9.95 

9.94 

X = 10.0495. 

10.01 

9.91 

S:, = 0.0397. 

10.04 

10.01 


10.01 

9.94 


10.09 

9.96 

y = 9.9525. 

10.06 

9.98 

= 0.0455. 

10.11 

9.96 


10.05 

10.03 


10.07 

9.94 

x-y = 0.0970. 

10.08 

9.91 1 

^.Kx-y -0.0135. 

10.03 

9.99 

fc( = )7.19. 

9.97 

9.99 

Pk < 10- 

10.07 

9.93 


10.05 

9.93 j 


10.04 

9.86 

Sx~Sjf = 0.0058. 

10.06 

9.86 

S.E, - 0.0095. 

10.06 

10.03 

A:(-)0.61. 

10.07 

9.96 

Fa > 0.5. 

^ lO.H 1 

9.96 1 


10.06 

1 9.96 



difference in means, 0,0970, must be 1,96 times Ud to be on this level, 
making equal to 0,05 approximately. Now 


<^d 


4 


+ ~ 
N, N, 


and if o-* equals Cy, then with N equal to 20 in each case, it follows that 

-2 ^ 2 


0.052 


20 20 


10 ' 


or 


and 


= 0.025, 

a = 0.16 approx., 
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where <t is the random error of analysis common to both chemists. One 
may now ask: Do the observed values of Sx and appear consistent 
with this value of cr? If o- equals 0. 16, then, for samples of 20, v, would 
equal 0.025, and the values of Sx and Sy would most probably exceed 
0.16 — 2 (0.025) = 0.11. But Sz and Sy are hoik far less than this value. 
Therefore it seems that the analysts differ quite significantly in the mean 
values they report. What then is the true protein content of the flour 
sample? It may be well at this point to leave the reader to his own 
reflections. 

The data presented as Table 5a in Chapter 2 provide opportunity for 
the reader to apply tests of significant differentiation in means and in 
standard deviations. The samples of brain weights for the two sexes 
fulfill the condition of being independent of each other, as is required by 
the above reasoning. This condition is not fulfilled by the data on 
statures of husbands and wives given as Fig. 18, wherein paired values 
arise and significant correlation has been demonstrated. We will now 
consider the problem of testing the significance of the mean difference 
between correlated pairs of variates. 


PAIRED VARIATES 

Every test of significance of the difference between the means of two 
samples is motivated, to some extent, by a desire to expose or refute the 
influence of some factor, or constellation of factors, which may be deemed 
to have exerted an influence of a “causative” nature. An investigation 
of the influence of such factors will be well designed when all factors other 
than those under consideration are precluded from influencing the 
results of the research. One of the best ways to achieve this end is to 
pair the variates in two series for all influential factors other than those 
under scrutiny. The influence of a single nutritive factor on growth of 
experimental animals, for instance, may be investigated through ade- 
quate design in this manner. Experimental rats might be paired for 
genetic constitution, age, and weight at the beginning of the experiment, 
one member going to diet group A and the other to diet group B. Assum- 
ing that during the experiment the rats are subj ected to the same environ- 
mental conditions, and ingest equivalent quantities of the two diets, any 
significant difference in their mean growth responses may be ascribed 
to the dietary factor. Any such difference is, at least, not influenced 
by the factors for which initial pairing was made. 

A direct approach in testing the significance of any difference in 
response of the groups is to secure the mean difference between the 
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pairs, and ascertain the probability that it might arise by chance. The 
equation, 

may readily be solved for k when it is recognized that 


S.E.5 




where N is the number of differences, or pairs of rats. 

It may, however, be desired to correlate the pairs for some purpose. 
In that event the individual differences between pairs need not be 
secured. If x and y designate the variates for the two groups, then 


1 ~ X — y, 

and Sd = + Sy — 2r^8;,Sy- (15) 

The reader may easily verify these equations for himself by following 
the pattern given in the closely analogous derivation of equations (8) 
and (10) on page 140 in this chapter. It follows directly, also, that 


S.E., = Vs.E.l + S.E.? - 2r„S.E.iS.E.5- (16) 


The importance of the factor in equations (15) and (16) may 
readily be appreciated from consideration of the following problem, if 
not on purely theoretical grounds. 

In 1925 the Minnesota State Testing Mill studied 56 carlota of wheat, An official 
dockage assessment on each carlot was made by the State Grain Inspection Depart- 
ment, and another test was made independently at the mill laboratory. The basic 
data need not be given here,® as we need only certain derived statistics, 

X = official test of dockage, in per cent 
7j = mill test of dockage, in per cent 
AT = 56 

X = 3.39 y = 3.94 

sx = 2.29 Sy = 2.26 

r.y -+0.97 

When the correlation is overlooked, equation (12) being used then in error, one finds 
fc (=) 1.27, indicating an insignificant difference in the two series of dockage tests. 
However, the correlation must properly be recognized by using equation (16). Then 
one finds that fc ( = ) 7.Q3, indicating a highly significant discrepancy between the 
dockage evaluations of the two laboratories. 

® A fuller discussion may be found in J. Am, Soc. Agron., 23: 558-571. 1931. 
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This is but one of a multitude of examples which might be given to 
illustrate how truth may be lost in a significance test involving paired 
determinations when the correlation between the variables is over- 
looked. The reader will readily discern that if a comparison of the two 
laboratories’ determinations had been made on approximately the same 
numbers of different carlots, both series being random samples from 
arriving shipments^ the difference in means would probably have been 
insignificant. Tremendously greater sensitivity of the investigation is 
achieved in this case by confining the analyses to the same series of 
materials, or using the principle of paired observations. 

SIGNIFICANT DIFFERENTIATION 

The terms “significant difference” and “insignificant difference” are 
frequently used with such finality of judgment in statistical writing 
that it is not surprising to find many students of science who aspire to 
command the means of such conclusive proof. They naturally essay 
to learn how to make statistical tests. To do so without appreciating 
fully the reasons why the test is appropriate is to court trouble. This is 
particularly true if tests of significance are undertaken when only the 
simple steps of computation are known. It is particularly desirable, 
then, that careful consideration be given to the implication of the term 
“significant difference,” which arises directly from interpretation of the 
probability that the difference between two statistical values being com- 
pared would arise through errors of random sampling alone. 

The foregoing data on sex differentiation in length at birth showed 
that, on the average for the samples considered, boys exceeded girls in 
length by about f cm., or slightly over | inch. In the light of purely 
practical considerations this must surely seem an unimportant difference. 
And yet one finds by statistical analysis that the observed difference or a 
larger one would not arise through errors of sampling alone more than 
once in a million times. Does this first pair of samples happen to be one 
of those exceedingly rare ones, or is it more reasonable to conclude that 
boys are longer on the average than girls at birth? It would seem most 
foolish to cling to the null hypothesis in view of the evidence, and so the 
interpreter naturally rejects it in favor of the alternative conclusion that 
there is a real differentiation of the sexes with respect to length at birth, 

The difference provided by the samples is taken to signify a difference 
of like character in the supplies of male and female infants from which 
the samples were drawn, and is called a significant difference. When the 
probability of the difference arising through sampling errors alone is 
large, as in the comparison of the standard deviations for length at birth, 



ERRONEOUS INFERENCE 


149 


then significance is not ascribed to the difference and the designation 
'^insignificant^' is used. Thus the verdict hinges entirely on the mag- 
nitude of the probability yielded by application of the null hypothesis. 

It would appear, then, that at sonje point on the probability scale 
there is a dividing line segregating the “significant” from the “insig- 
nificant” probability. Such a conclusion, however, is merely a half 
truth. Two other most important factors must be considered. The 
transition from one judgment to the opposite one as one advances along 
the probability scale may logically take place gradually over a range of 
the scale instead of sliarply at a point. Also, the consequences of reject- 
ing one hypothesis in favor of the other must be considered in location 
of the transition zone (or zone of doubt), be it wide or narrow. 


ERRONEOUS INFERENCE 

The parameters being considered in the null hypothesis must be, in 
fact, either identical or different. From study of the available data 
given by the samples the statistician wishes to infer the correct statement 
from these two alternatives. That is, his inference when made will be 
either right or wrong. If he should draw the wrong inference, he must 
naturally face responsibility for the practical consequences of his mis- 
take. Such erroneous inferences must fall into one of two classes : 

(A) insignificance is claimed when a real difference in the para- 
meters exists, or 

(B) significance is claimed where none exists. 

Errors of these two classes have entirely different consequences. 

Erroneous' inferences falling under class (A) above will cause the 
rejection (complete or tentative) of the data as being insufficient to war- 
rant the claim that there is a real difference between the parameters 
being considered. This action essentially demands that more evidence 
is required before significance may properly be inferred. That evidence 
may, of course, be sought through more refined or extensive investiga- 
tion if the problem is important enough to warrant continuance of the 
research. 

Errors of cla.ss (B), on the other hand, ordinarily terminate further 
research in verification. Indeed, acceptance of a verdict of significance 
may well initiate a developmental program of new investigations. Such 
a program will be based on a wrong premise, of course, if the initial con- 
clusion of significance was faulty. Therefore errors of class (B), assign- 
ing significance where none really exists, are likely to be the more serious 
by far in their immediate effect on the progress of science. The pro- 
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cedure of inference from a probability derived under the null hypothesis 
should, then, be characterized by defenses against making an error of 
class (B) above. 

The probability of differences between sample statistics being due to 
errors of random sampling may, of course, take any value from unity 
down to zero. Between the values of 1 and 0.5, the probability is so 
high as to cause acceptance of the null hypothesis without question, in 
the absence, of course, of other relevant evidence. As the probability 
falls below 0.5, however, the evidence may be taken to favor increasingly 
the rejection of the null hypothesis. As a defensive measure against 
making an error of class (B) above, it is customary among statisticians 
to require that the probability be less than 0.05 before significance is 
claimed, that is, before a real difference in parameters is inferred and the 
null hypothesis is rejected. Any such level of probability beyond which 
significance is claimed is known as a level of significance. 

Let it be assumed now that a very large number of tests of signifi- 
cance are to be made by a statistician who chooses to depend on a 5 per 
cent level of significance. Let it further be assumed that, unknown to 
him, the differences are all actually due to random sampling errors and 
are unselected for magnitude; the pairs of samples may be assumed 
deliberately drawn from the same supply, but this is really immaterial. 
In applying the test in each case arising from the null hypothesis, he will 
find that 5 per cent of the differences actually exceed the chosen “level 
of significance.^^ He would normally take these to signify real differ- 
ences in the parameters of reference, if his suspicion to the contrary was 
not aroused. But actually these 5 per cent of “significant differences” 
are by definition only random sampling differences from supplies with 
identical parameters. That is, under the specified conditwns the infer- 
ences will be wrong 5 times in 100 when the 5 per cent level of signifi- 
cance is used, and the percentage error will be given always by the level of 
significance used. 

It is most important to recognize in the situation just discussed that 
all differences were postulated as being due solely to errors of random 
sampling. Only errors of class (B) above were possible. Such a condi- 
tion is not likely to prevail over any substantial length of time in statis- 
tical experience. Let us now reconsider the problem modified to par- 
allel practical situations. Let us mix with the above paired samples, 
differing solely by sampling errors, a number of pairs of samples for 
which the parameters are really different. In testing the null hypothe- 
sis, both types of error may now be made, but let us focus attention on 
the more serious ones comprising class (B) only. The proportion of 
wrong inferences claiming significance when none exists will now be 
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diluted to something below 5 per cent because many of the claims of 
significance will now be right, and these correct ones must form more 
than 6 per cent of all really different paired samples. Thus a 5 per cent 
level of significance adopted in practical statistical work will lead to a 
maximum error of 1 in 20 in claiming significance. 

If this proportion of class {B) error is too high, then it is a simple 
matter to shift the critical value for inferring significance to some lower 
level, such as 1 per cent, or even less. Such a shift must, of course, lead 
to a corresponding increase in errors of class (.4). The price of avoid- 
ance of errors of one class in establishing a level of significance is inevit- 
ably one of committing more errors of the other class. 

The problem of choosing an acceptable level of significance is to 
find an appropriate balance between two types of error of entirely differ- 
ent character. This balance must logically be subject to change from 
one situation to another. The fundamental problem is of the relative 
importance of the two types of error to the particular investigation 
involved. Unless the relative importance is constant ttfiough all prob- 
lems, it is obviously not reasonable to establish a critical value of prob- 
ability for all situations as bases of statements of significance. It might 
be an acceptable risk to endanger the life of 1 experimental rat in 20, 
but the same attitude would hardly be sustained if human fife is involved. 
It is on the grounds of such reasoning that the writer of this discussion 
feels that judgments of significance should not be reached purely on the 
relation of a determined probability to some fixed critical value. 

The research worker should recognize clearly the nature of this 
probability. It does not measure the probability that the observed 
difference between the statistics did arise solely through sampling 
errors. It m&ely defines the probability of a difference as great as the 
one observed arising if there is no difference between the supply para- 
meters. On the basis of this evidence he is to decide whether the null 
hypothesis should, for purposes of scientific inference, be rejected. His 
decision should properly be guided by the practical importance of the 
consequences engendered if his conclusion should prove to be the wrong 
one. More convincing evidence, such as is given by a lower probability, 
may well be desired in some cases than in others. 



CHAPTER 11 


SAMPLING ERRORS OF THE CORRELATION COEFFICIENT 

The coefficient of correlation was introduced by Galton and developed 
by Pearson as a statistic descriptive in a relative manner of the intensity 
of association between variables adhering to the normal law of bivariate 
frequency distribution. It is widely used in biological investigations 
today as providing satisfactory fulfillment of this descriptive function. 
In other cases, proper attention to the fact that its usefulness in this 
connection is restricted to normal surfaces is often lacking. The exis- 
tence of rectilinear regressions alone is not adequate for the logical com- 
parison of correlation coefficients. Lack of regard for this point (and 
others) has led to many erroneous interpretations, criticisms appropriate 
to the interpreter then being directed against the statistic. Undoubtedly a 
large proportion of biological investigations stop at correlation coefficients 
when they should properly be carried fonvard to regression equations 
and studies of residual variations, for definition of which the correlation 
coefficient may serve as an excellent generating function. There remain, 
nevertheless, many situations wherein direct comparison of correlation 
coefficients is fully appropriate to solution of the problems under analysis. 

Twm important questions arise in consideration of the magnitudes of 
correlation coefficients calculated for normal surfaces as measures of 
intensity of association. They are: 

(1) How large should the value of the correlation coefficient be, 
when based on N paired observations, to justify the conclusion that 
correlation really exists in the supply? 

(2) When is the difference between two correlation coefficients 
sufficiently large to warrant assumption that the supplies from which 
the samples were drawn really differ in their intensities of association? 

Both these questions flow immediately from recognition of the fact that 
the correlation coefficient, like other statistics, must be expected to be 
influenced in its magnitude by errors of random sampling. 

Let the Greek symbol p designate the correlation coefficient for an 
infinitely large supply of paired values following the normal law. From 
this supply let samples of size N be considered drawn at random. The 
correlation coefficients r of these samples may be anticipated to be dis- 
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tributed about p as a more or less central value. Of what form is this 
sampling distribution of r, and how may one determine from it the proba- 
bilities appropriate as bases for answering the two types of questions 
above? 

THE SAMPLING DISTRIBUTION OF r WHEN p EQUALS ZERO 

Having no other information about correlation in the supply than is 
given by the sample itself, one naturally faces the possibility that p is zero 
and that the magnitude of r, however large it be, might really be for- 
tuitous. What is the sampling distribution of r when there is no correla- 
tion in the supply yielding the samples? If, from a normal supply of 
paired but uncorrelated measures, samples of a given size are considered 
drawn at random, one would not anticipate that the correlation coeffi- 
cients calculated for these samples would all equal zero, nor indeed that 
any one would equal zero exactly, except by chance. However, a cluster- 
ing about zero might logically be expected for the values of r; a frequency 
distribution of symmetrical form with zero as its mean. One finds it 
unreasonable to construe any factor consistent with random sampling 
that would give other than a symmetrical distribution of r around the 
parametric value of zero. Whatever chance factors arise to yield samples 
with a certain positive value of r must surely be expected to lead in the 
long run to an equal number of samples with correlation coefficients of 
the same numerical magnitude but of opposite- sign. The normal bivari- 
ate frequency distribution with p equal to zero is radially symmetrical 
about its center. Also, the tendency for all but the smallest samples at 
least must be to reproduce the supply characteristic of zero correlation. 
A symmetrical l-shaped curve seems inescapable under such conditions. 
However, a strictly normal sampling distribution of r cannot be realized 
technically in this situation because a normal curve about zero as center 
must be of infinite range, whereas the scale of r is confined to the range 
between — 1 and + 1 . What precisely is the form of this sampling dis- 
tribution of r when p is equal to zero? 

It is idle to attempt mathematical derivation of the sampling dis- 
tribution of r with no more than simple algebra as a tool. One may more 
profitably turn here to geometric representation of the theoretical solu- 
tion of the problem, first reached on partly intuitive grounds by Student 
in 1908.^ In Fig. 35 the precise sampling distribution curves of r when 
p equals zero arc drawn in full line for three values of N. Two features 
will be noted immediately from this figure : 

* Student. Probable error of a correlation coefficient. Biometrika, 6: 302-31 0 
1908. 
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(1) for small samples, r may frequently take quite high values 
solely through errors of sampling; 

(2) as N increases, the probability of r exceeding any fixed value 
falls rapidly. 

FIGURE 35 

Random Sampling Distributions of r, Based on Samples of 
Size N , when p Equals Zero 



Scale of r 

These distribution curves are actually so close to the normal form that 
the difference is essentially negligible except for rather small' values of N. 
The broken-line curve in Fig. 35 is the normal distribution having the 
same mean and standard deviation as the true curve for N equal to 10. 
Even there the discrepancy between the two curves is not very great. 
For N equal to 20 and above, the lack of concordance between the true 
curve and the corresponding normal curve is so small as to be of no prac- 
tical consequence. It may be noted that the kurtosis of the true curve is 
negative. It is “flatter topped” than the normal curve of like mean and 
standard deviation. While the range for all the true curves is theoreti- 
cally — 1 to +1, the frequency beyond three standard deviations may for 
practical purposes be ignored, at least when N equals or exceeds 20. 

The normal curve may well be accepted to portray with adequate 
approximation the sampling distribution of correlation coefficients when 
p is equal to zero and A/" is 20 or above. In order to define this closely 
approximating normal curve it is necessary, of course, to know the stand- 
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ard deviation of the true distribution. Student derived the definitive 
equation as 


1 

Vn-1 


( 1 ) 


Biolo^sts confining attention to large samples frequently ignore the — 1 
in the denominator, but one may just as well choose to retain it to avoid 
changing the formula when N does become rather small. 

One may proceed on the basis of this information to solve the very 
practical problem of determining whether any given value of r, calculated 
for a sample of size N equal to 20 or above, may be accepted as signifying 
correlation in the supply from which the sample is drawn. The procedure 
is very simple. The null hypothesis is that p actually is zero. One must 
then determine the probability that the ^ven r is a deviation from zero 
due to errors of sampling. Using equation (1) above, the probability of 
any value of r or a greater one arising through errors of random sampling 
will be given with adequate accuracy for most practical purposes by 
calculating the relative deviate kof rm the normal distribution of zero 
mean, and then referring to tables of the normal probability integral. 
Now 


Vn-i 


= rVN- 1 . 


Remembering that when k equals or exceeds 2 the probability of the devi- 
ate being exceeded is 0.045 or less, one may perhaps choose to formulate 
the guiding rule that, when r -y/W ~ 1 does not exceed 2, the correlation 
coefiicient of the sample does not signify real correlation in the supply. 
This rule, of course, adheres closely to acceptance of the 5 per cent level 
of significance as the demarkation point of significant deviations. 

When the probability derived under the null hypothesis proves to be 
sufficiently small, the assumption that p is equal to zero may logically be 
rejected in favor of its alternative, namely, that correlation does exist in 
the supply. What the value of p is precisely in the latter event, one can- 
not say. One may choose to accept the observed r as an estimate of the 
value of p, knowing, of course, that rejection of the null hypothesis is 
tantamount to acceptance with confidence that p is of the same sign as r. 

The example of the normal bivariate surface used as the first illustra- 
tion in Chapter 7 yields a correlation coefficient of +0.28 for assortative 
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mating for stature in 1,079 English couples. Since VN - 1 equals32.8, 
hr equals 9.2, There can be no doubt of the statistical significance of this 
correlation, Had the same correlation coefficient (r — +0.28) been 
found for a series of but 50 couples, would it still be significant? This 
time \/N ~ i equals 7 and kr becomes 1.96, a value precisely on the 5 
per cent level of significance. In the absence of any other information, 
one might be hesitant in claiming significance in such a case. Once in 20 
times with samples of 50, r would exceed +0.28 solely through random 
sampling errors. In such a situation more data might be sought before 
a decision was reached. 

It should be recognized clearly that a correlation coefficient desig- 
nated as statistically insignificant docs not prove lack of association in the 
supply. It merely indicates, in the absence of other evidence to the 
contrary, that the assumption of absence of correlation in the supply is 
sufficiently consistent with the observed value of r to warrant its serious 
consideration as an appropriate hypothesis. 


THE SAMPLING DISTRIBUTION OF r WHEN p IS NOT ZERO 


It has been shown in the preceding section that the errors of sampling 
to which r is subject when p is actually zero are very groat when the 
size of sample approaches small magnitudes. With 12 or few^cr cases per 
sample, r is found to exceed the value 0.5 in approximately 10 per cent or 
more of cases, solely through errors of random sampling. By way of 
contrast one may turn to the question of the sampling error in r when p 
takes the limiting value of unity. Under this condition all points in the 
supply distribution fall on a straight line and therefore every ‘sample must 
reflect the same situation. The sample coefficient will always be unity 
when p is unity, errors of sampling having vanished entirely, regardless of 
the size of the sample. 

The sampling distribution of the correlation coefficient reduces to 
relatively simple situations when p takes its limiting values of zero and 
unity. On the basis of this knowledge of limiting conditions one may be 
tempted to anticipate that the sampling distribution of r might follow the 
simple transition of a reasonably normal curve, collapsing in standard 


deviation from 


1 


to zero, as p is moved from zero to the limiting 


value of unity. While such expectation is verified with respect to the 
magnitude of the standard deviation of errors of sampling, it is far from 
true with respect to the form of the distribution. 

The nature of the sampling distribution of r for values of p other than 
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zero was not established precisely until 1915, when Fisher* derived the 
equation without approximation. On the basis of calculations by Soper 
ei al.,3 facilitated by this work, the sampling distribution curves of r for 
three values of p and two sizes of sample are given in Fig. 36. The upper 
panel relates to samples of but 12 individuals each, the lower panel giving 
the corresponding curves when N is 50, p being taken as zero, 0.4, and 
0.8 in both panels. 

FIGURE 36 

Random Sampling Distributions of r for Specified Values of p and N 



The purpose in presenting Fig. 36 is to give some indication of the 
opposing influences exerted on the form of the sampling distribution of r 
by increasing the magnitudes of p and N. Starting from a symmetrical 
and fairly normal curve when supply correlation is absent, the form of 
distribution becomes increasingly skew as p is increased but N is held 
constant. For any given value of p, on the other hand, as N is increased 
the sampling distribution of r has its skewness diminished and approaches 

* R. A. Fisher. Frequency distribution of the values of the correlation coefficient 
in samples from an indefinitely lar^e population. Biometrika, 10: 507-521. 1915. 

^ H. E. Soper, A. W. Young, B. M. Cave, A. Lee, and Karl Pearson. On the 
distribution of the correlation coefficient in small samples. Biometrika, 11: 328 hH3. 
1917. 
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the normal form. It is posable then to establish on a (p, N) coordinate 
surface an area within which the normal curve may serve fairly well as a 
representation of the sampling distribution of the correlation coeflficient. 
The shaded area in Fig. 37 demarks such a zone within which the normal 
curve might be used with adequate approximation; its mean and standard 
deviation being given by 

hr == Pt ] 


and 




(2) 


FIGURF: 37 


Demabkation of an Empirically Determined Zone (Shaded) on the Surface 
OP ( psN ), WITHIN WHICH the Sampling Distribution of r Mat Be Considered 
Practically Normal 



Scale of N 


The chief interest of Fig. 37 lies not in its demarkation of a zone of 
(p, N) within which a normal curve approximation might be satisfactory 
for the sampling distribution of r, but rather in its indication of the wide 
range of values of p and N within which normal curve approximations 
will not be adequate. In the memoir just cited, Fisher (1915) drew atten- 
tion to the feature of rapidly intensifying skewness of the sampling dis- 
tribution of r as p approaches its upper limit of unity, indicating the 
unsuitability of skeleton tables of the probability integral of the true 
distribution in that region. In 1921 he ^ removed all these difficulties by 

^ R. A. Fisher. On the "probable error" of a coefficient of correlation deduced 
from a small sample. Metron, 1 (No. 4); 1^2. 1921. 
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an ingenious transformation of scale which converted all the sampling 
distributions of r into reasonably normal curves. The transformation ^ is 


Zr = i[log. (1 + r) - log, (1 - r)] 


= 1.1513 



(3) 


The principle of rectilinear transformation of scale has been stressed in 
these pages as providing a method of coding leading to the utmost attain- 
able simplicity in calculation. In like manner Fisher’s curvilinear trans- 
formation of the r scale serves to reduce a continuous array of rapidly 
changing curve forms to an almost invariant form, approaching the 
normal law so very closely that the latter provides an excellent approxi- 
imation for aU but very small samples. 

This contribution of Fisher is notable for its reduction of a very 
complex problem to one of great simplicity. Probabilities corresponding 
to values of Zr may be obtained with considerable precision from tables 
of the normal curve. These probabilities, however, must be identical 
with those pertaining to the corresponding values of r itself. The fre- 
quency of occurrence of r is not altered by transformation; the scale only 
is transformed. If f is the transformed value of p following equation (3), 
then Fisher has shown the mean and standard deviation of the Zf curve 
to be 


and 


1 ) 


(4) 


The Zr curves secured by transformation of the six different sampling 


^ It may be of interest to the mathematical reader to note that the dimensional 
ratio of the major to the minor axis for the normal surface ellipse, plotted in terms 


of relative deviates, 




This ratio is, of course, very intimately related to 


intensity of association. The transformation suggested by Fisher is simply one of 
replacing r by the natural logarithm of this ratio. It is readily seen from Fig. 22 
that, as p passes from zero to unity, this ratio changes from unity to “ infini ty.” 
Therefore the logarithm of the ratio varies over the range from “minus infinity” to 
“plus infinity,” 
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distributions of r in Fig. 36 are given in Fig. 38. The constancy of these 
curves in form and standard deviation for each value of N will be noted 
immediately; the contrast with the parent curves in Fig. 36 is striking. 
The probability that any given deviation of r from p, or of one value of 
r from another, would arise through errors of random sampling may be 
determined by transformation to Zr, proceeding with the analysis as 
already detailed for normally distributed statistics. 

FIGURE 38 

Zr Transformations of the Random Sampling Distributions or r Given in 
Figure 36 



— I . 1 1 h- 

- 0.5 0 + 0.5 + 1.0 + 1.5 

Scale of Zr 


The tables of Appendix II to this volume have been, prepared to 
expedite the initial transformation of r into Zr. Simple arithmetic inter- 
polation between the tabled values of Zr will prove adequate for deter- 
mination of the values not actually given. The application of the trans- 
formations may perhaps be dealt with more advantageously now through 
consideration of specific numerical cases. 

Problem 1. What are the values of Zr corresponding to correlation coefficients of 
+0.6125 and +0.9015? 

From Appendix II, Table A, we have 

r Zr 

0,61 0.7089 

0.62 0.7250 

When r = 0.6125, then 

Zr = +0.7089 + 0.25 (0.0161) = +0.7129. 


Increment in Zr 
0.0161 
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From Appendix II, Table B, we have 


r 

Zr 

Increment in Zr 

0.901 

1.4775 

0.0053 

0.902 

1.4828 



When r = +0.9015, then 

= +1.4775 +0.5(0.0053) = 1.4802. 

Problem 2, Is a correlation coefficient of +0.6125, based on only 12 pairs of 
measifgements from a normal supply, significantly less than the value of +0.9015 
found for a comparable series of 19 pairs of measurements? 


1 


Coefficient 

Zr 

N 


For ri - +0.9015 

+ 1.4802 

19 

0.25 

For rs - +0.6125 

+0,7129 

12 

0.33 

Difference 

0. 7673 




Since the standard deviation of Zr is independent of p, the above values involve no 
approximation, They are not standard errors but true standard deviations of normal 
sampling-error curves, of which the means are not known because p is in each case 
unknown. The problem is to test whether both samples may have come from sup- 
plies having the same value of p, the analysis logically reducing in many problems 
to testing whether both samples may have come from the same supply. This is a 
matter again of examining the null hypothesis. If pi is the same as P 2 , then the 
two sampling distributions of Zr above have the same mean value. It follows directly 
that the mean of the distribution of differences arising solely through errors of random 
sampling will be zero. Also, the random differences curve will be normal in form 
with standard deviation, 

<Td = VaaFTo^ - o.4i67. 

The situation is precisely analogous to that in testing the significance of the difference 
between normally distributed means, Securing the relative deviate of the observed 
difference in this distribution, 


and from Appendix I, 

Thus the observed difference between the two sample correlation coefficients would 
arise 6 or 7 times in 100 solely through errors of random sampling. May one appro- 
priately conclude in this case that there is insufficient evidence to warrant the 
assumption of a significant difference between the two correlation coefficients? 
Since P is less than 0.5, the evidence favors differentiation, but the margin of doubt 
is not negligible by any means. Not knowing to what biological relationships these 
coefficients pertain, one is unable to consider the issue of the practical importance of 
inferring real differentiation erroneously. 

Small samples have intentioually been chosen for the above problem 
for two reasons. First, the technique itself is probably accurate enough 


. 0.7073 

k = = 1.84, 

0.4167 


P = 0.066. 
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for practical purposes even with N as small as 10. Second, the problem 
serves to indicate the relatively large differences in correlation coeflScients 
which may arise through errors of random selection with samples of 
small size,- even in the upper reaches of the scale where increments in r 
mean more than for lower values. 


THE SIGNIFICANCE TEST FOR r WHEN N IS SMALL 

The rather high accuracy of the foregoing transformation technique 
provides a reliable method for testing whether a correlation coefficient 
calculated from a small sample signifies real correlation in the supply 
from which it is drawn. Warning has been given that, when p is zero 
and N is quite small, the sampling distribution of r although symmetrical 
is sufficiently platykurtic to make the use of a normal curve inadvisable. 
Transformation to 2r yields a curve more nearly normal in form which 
may well be used in preference. 

In 1921, however, Fisher * provided another transformation in terms 
of which the appropriate probability may be secured without any 
approximation whatever. This transformation is as follows: 


tVn - 2 
Vl — 


( 5 ) 


Detailed study of the symmetrical leptokurtic sampling distribution of t 
must be reserved for another occasion. Its probability integral has been 
rather fully tabled by Student,^ who formulated the distribution for 
other purposes in 1908. Present interest in it lies solely in its provision 
of the precise sampling probability of r when p is zero and N is small. For 
this purpose, in lieu of giving Student's tables in full. Fig. 39 has been 
prepared. The lines of this chart trace the 10, 5, 2, and 1 per cent sam- 
pling probability values in r for values of N from 4 to 20. Locating the 
point (r, N) on this chart corresponding to any given sample value, one is 
able to determine whether the probability of the given value of r arising 
from an uucorrelated supply falls short of or exceeds 10, 5, 2, or 1 per 
cent. These percentages cover the critical zone in most tests of signifi- 
cance of a correlation coefficient. 

® Vide mpra. 

^Student. New tables for testing the significance of observations. Metron, 
5 (No. 3): 18-21 and 26-32. 1925. 
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By way of illustration one may test the significance of the correlation 
coefficient +0.6125 based on 12 observations. 

(1) Reference to Fig. 39 shows that the probability of this r being 
exceeded by random sampling errors alone is between 5 and 2 per ^ 
cent. The precise value determined from “Student's” table is 0.0344. 

FIGURE 39 

Probability Zones That the Deviation of r from Zero May Be a Random 
Sampling Error 



(2) Using Fisher's logarithmic transformation, the relative devi- 
ate of Zr is ZrVN — 3- Thus one finds K = 2.1387. From Appendix 
I, P = 0.0325. 

(3) The coarser approximation of assuming the r distribution to 
‘be near enough to normal ^ves kr = r\/N — 1 = 2.0314, and 
P = 0.0422. 

Thus the Zr technique gives very nearly the precise value. The r\/N—l 
technique yields somewhat too high a value of P, an error on the con- 
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servative side so far as claiming significance is concerned. As N increases, 
the discrepancy between the results given by the three methods decreases, 
becoming essentially negligible beyond N equal to 20. 

In bringing to a close this brief discussion of sampling errors in the 
correlation coefficient, it seems pertinent to remind the reader that the 
correlation coefficient taken alone does not fully define any normal sur- 
face, Associations between the same variables having the same corre- 

FTOURE 40 

Representation of 3 Normal Surfaces Having Identical Regression Lines 
AND Residual Variations in Predicting y from x, but Widely Different 
Correlation Coefficients 



lation coefficient in samples of different origins may differ widely in 
their regression equations and residual variations. Likewise, surfaces 
may have precisely the same regression line and residual variation 
about that line in predicting one variable from the other, ^et have very 
different correlation coefficients. Figure 40 is presented to illustrate the 
latter possibility, In the three surfaces represented therein by contours, 
the regressions of ^ on a; are identical, and is the same throughout. 
The values of are, however, 0.3, 0.6, and 0.9. The surfaces were 
readily constructed through selection of suitable values of Uj; and ffy 
yielding the desired likenesses and contrasts. 

The inve.stigator depending on the correlation coefficient as a de- 
scriptive statistic for comparative work should not fail to secure thorough 
appreciation of such points as the above. 
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PROPORTIONS AND PROBABILITY 

Experience flowing from oft-repeated experiments has led to general 
establishment of the belief that, when an ordinary coin like those of the 
currencies of the English-speaking peoples is suitably tossed ^ and allowed 
to come to rest, the likelihood of one particular face of the coin being 
uppermost is not appreciably different from that of the other face being 
uppermost. Using such experience as a basis for forecasting the outcome 
of future experiments, one commonly states that the probability of the 
one event (e.g., head face uppermost) is equal to the probability of the 
other event (tail face uppermost). A third possible event, namely, that 
of the coin finally resting on its edge, may be recognized. However, any 
experiment yielding such a result may be rejected empirically as having 
no bearing with respect to the issue of the relative probabilities of “heads’^ 
and “tails.” 

This very familiar experiment is quite useful as a basis for the estab- 
lishment (a) of certain definitions of wide scope, and (&) of consequences 
of a general nature flowing from those definitions. First of all there arises 
the concept of probability, the likelihood of ail event occurring in the 
future based on experience of occurrence of that event and its alterna- 
tives in the past. Probability is belief founded on evidence. Probability 
may be given approximate numerical value by counting the events of 
finite experience and deriving the ratio of the count Nx of the specified 
event, x, to the count iVr of all the events wherein the specified one might 
have occurred. In the notation just given, N r obviously equals the sum 
of Nx and the count of the events y which are alternative to the event 
x'fNris the total number of events wherein z might have occurred. The 
degree of approximation of this ratio to the true probability becomes 
closer, of course, as A'r increases. The true probability is precisely de- 
fined by infinite experience only. 

If h is the event of a “head” in a coin-tossing experiment, t is the 
alternative event of a “tail,” and we exclude any “coin-on-edge” event 

^ By this is implied a procedure of tossing or shaking which is free of particular 
motions or “tricks’' calculated to bias the outcome of the experiment. 
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as belonging to an inappropriate experiment, then the “probability of 
beads/' p;,, is given by the equation 

^ Nt Nh -h Nt 

Again, if m designates male sex and/female sex, then the probabilities of 
male and female sex at birth as derived from a finite experience involving 
r births are: 

p„ = -. and p;^-, 

wherein Nt = N/ , 

It is known from extensive experience that, under proper conditions, 
both ph and pm approximate to the numerical value of 0.5. The prob- 
ability of securing the ace face (or any one specified face) in the random 
tossing of dice is commonly believed to be one-sixth, a value that vrould 
be expected on theoretical grounds for unbiased dice. The probability 
of germination of seed when placed in a proper environment is well known 
to depend on the kind and condition of the seed. Because of these fac- 
tors it may vary over a wide range. Seed-testing laboratories are con- 
tinually engaged in experiments to ascertain the percentage germination 
in submitted samples. Such determinations serve as estimates of the 
probability of germination appropriate to each supply of seed from which 
the sample is drawn. 

These so-called probabilities are merely expressions of proportions in 
fractional form. A ratio of frequencies is involved in each case, the 
numerator of the ratio being by definition a part of the denominator. 
Thus the probability scale exists only between the limits of zero and 
unity. Also, each value of p on the scale is a pure number; it has no 
dimensions of measurement. The scale itself is a fundamental one, and 
may be designated for convenience as the standard scale of probability. 

Proportions derived from frequencies are of very common occurrence 
in the analysis of biological data. They form the lo^cal bases of com- 
parative and interpretive work in many cases. Descriptions of the rela- 
tive frequency of occurrence of events play a most important role in the 
field of vital statistics. Designated as rates, these numerical expressions 
will receive more extended consideration in the following chapter. At the 
present moment it is of interest to recognize merely that they form a class 
of proportions which may be accommodated to the standard scale of 
probability. 

Although it is convenient in mathematical work to do so, it is not 
necessary that probabilities be expressed as fractions on the standard 
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probability scale. To many minds, simple whole numbers are more 
readily comprehended than fractions, and so probabilities are frequently 
quoted colloquially as percentages or as odds, while rates are commonly 
given as per 100, per 1,000, per 10,000, or per 100,000, according to the 
magnitudes in general of the type of proportion under consideration. 
Fundamentally, however, all are simple proportions derived from the 
corresponding magnitudes on the standard scale from zero to unity. 

Reference so far has been confined to proportions derived from some 
known experience. It is by no means necessary to restrict our definitions 
in this manner. Probabilities may arise equally well from purely theo- 
retical considerations, or even from arbitrary conjecture. While the 
latter may have but little value, theoretically determined proportions 
based upon sound reasoning embracing all known determinative factors 
may be intrinsically preferable to a value based on rather limited experi- 
ence. In that which immediately follows, we shall use theoretically 
derived probabilities rather freely, on this account. The reasoning will 
be quite general, however, for all proportions, and will be equally appli- 
cable to all situations that are in accord with the fundamental hypoth- 
eses made. 

Let p designate the proportion of favorable events in occurrences 
which may be either “favorable' ' or “adverse,’' and let q designate the 
proportion of the adverse events. The terms favorable and adverse as 
used here are not intended to imply quality, but serve as convenient 
designations of simple alternatives; adverse includes all events which are 
not to be designated as favorable. Then it follows irrefutably that 
p + 5 = 1 under such circumstances in all experiences. Let it be 
assumed now that there is available a large number of perfect “coins,” 
each with its center of gravity lying precisely in that plane which bisects 
the rim. If the two faces are separately recognizable and designated aa 
“head” and “tail” faces, then one can think of no logical reason why, in 
random tossing, one face should appear more frequently than the other. 
Designating one face as favorable, one may theoretically conclude that, 
for such coins, p = g = 0.5.® 

Consider now the actual random tossing of one such coin. In the 
first toss the result would either be one head or one tail. The coin can- 
not divide itself. Let the coin be tossed again. As the tossing is repeated 
under the ideal conditions of randomness, each toss will be uninfluenced 
by what preceded, and the proportions of heads and tails will approach 
statistically the limits p _ ^ _ q 5 

® The probability of heads in tossing an ordinary coin is likely to be an irrational 
fractional number slightly greater or less than 0.5, depending on the character of 
the coin. 
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Let us follow this assumption to its logical consequences when an 
infinitely large group of perfect coins are considered to be repeatedly 
thrown and classified after each throw. It will be convenient for the 
present illustration to write our expectancies in terms of a finite number 
of 32 coins. If these coins arc tossed in a perfectly random manner, the 
coins then classified according to the uppermost face when they come to 
rest, then thrown again as subgroups, classified to the finer groups, and 
so on, the experiment may be anticipated in its results by the following 
ideal scheme: 


SCHEME A 


Toss 

1 

2 

3 

4 

5 



An At Ah At Ah Ah At 

AAAAAAAA 

2h 2t 2h Zt 2k 2t 2h Zt 2k 2t 2h 2t 2k 2t 2h 2t 

AAAAAAAAAAAAAAAA 

hthtkththiktktktkththtktktkththt 


No real experiment with only 32 coins is ever likely to give such an ideal 
division between heads and tails at each toss. However, one wishes to 
represent here in terms of a limited number of coins the proportions that 
would be true if the number was infinitely great. After the fifth toss one 
would theoretically have 32 subdivisions of one coin each, one-half of 
them heads and one-half tails. The present major interest is to determine 
from the diagram the history of each coin in its successive falls. This is 
given in every case by the corresponding channel of the diagram. It may 
be seen, if the rather tedious analysis is undertaken, that no two histories 
are exactly alike, nor is any other type of history possible than is given 
by the diagram. Such sequences of events are known in mathematics as 
permuiatims. It follows from our ideal diagram that every possible 
permutation will occur with equal frequency at any given number of 
tosses. However, if the order in which the head and tail faces appear is 
ignored, then certain combinations will be observed to occur more fre- 
quently than others. If the superscripts to the letters h and i are used 
arbitrarily to indicate the number of heads or tails respectively shown in 
the history, then one may summarize all the combinations up to and 
including each successive toss by series of terms as follows: 
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SCHEME B 

Toss Combinations and Actual Freqdenxies De hived from Scheme A. 

1 + 16^^ 

2 Sh^ + 

3 4k^ + 12hH^ + 12hH^ + 4e 

4 2ft* + + ShH'' + 

5 Ift^ 4- 5hH^ + lOft^^^ + 10ft¥ + 5ft^i* + 

Each type of combination is represented above as hH^, where x and y are 
respectively the number of heads and tails in the combination. The 
actual frequencies of occurrence of the different combinations are in each 
case the numerical coefficients of Thus means that, on the 
average after the fifth toss, 10 coins out of 32 will show an experience of 
having fallen heads up 3 times and tails up twice. 

Eliminating the superscripts and numerical coefficients of unity as 
being understood, and dividing the numerical coefficients of the several 
terms for any one toss by the common factor so as to reduce the fre- 
quencies to their simplest numerical proportions, one may write the 
respective combinations with their relative frequencies of occurrence as: 


SCHEME C 


Toss 

Combinations and Their Relative 

Equivalent 

Binomial 


Frequencies of Occurrence 

Expression 

1 

h i 

{h + O' 

2 

ft2 + 2hi + 

(h + O' 

3 

ft^ 4' 3ft^^ 4" 3ft^'^ 4" 

(h + O' 

4 

ft* 4 - 4hH + ^4 

(h + O' 

5 

+ hhH + lOft^/2 4- lOhH^ + 5hi* + 

(h + O' 


The reader will probably recall enough elementary algebra to recognize 
the similarity between these expressions for the combinations and the 
expansions of the binomial expression + 0 ”, wherein n corresponds to 
the number of the toss, i.e., the number of events in each group. It may, 
in fact, be construed as the function of the binomial theorem to provide 
a systematic scheme of evaluating the relative number of each of the pos- 
sible combinations of two independent events in multiple association. 

In the foregoing reasoning a factor of the utmost importance has been 
the assumption of independence of the events in any combination. Coins 
may be tos.sed improperly, or dice so shaken that each result is not wholly 
independent of its associates in the combination. Professional gamblers 
are aware of and commonly use devices to interfere with the free play of 
chance. It is desirable for our purposes to have a rigid definition of 
independence, one readily amenable to mathematical test. 
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Let pi be the probability of an event and let p 2 be the probability of 
another event which may occur in combination with the former. Then 
the two events are independent of one another if the probability of joint 
occurrence of those events equals the product pip 2 of the probabilities 
of their separate occurrence. The truth of this statement, which com- 
prises the usual mathematical definition of independence, may readily be 
made apparent in terms of an actual example. Let there be 3 men to 
every 2 women in a large office force, every member of which is given an 
apple. If 10 per cent of these apples are infested with grubs, and the 
apples are given out indiscriminately, then ideally 10 per cent of the men 
would receive infested apples and also 10 per cent of the women would 
be equally unfortunate. What is the probability that the combined 
event, a woman with a good apple, will occur? Since only 40 per cent 
are women and 90 per cent of the apples are good, 36 women on the 
average out of every 100 employees will receive good apples, if sex and 
goodness are independent; that is, the appropriate probability is 0.36, the 
product of the separate probabilities, 0.4 and 0.9. 

Having thrown the concept of independence into this perfectly gen- 
eral form of a mathematical definition, one may proceed to apply it in a 
wide array of problems that may immediately suggest themselves. Re- 
turning to the coin-tossing experiment, if p is the probability of “heads,” 
q that of “tails,” then pXpXpXqXq = p\^ is the probability of 
the permutation of 3 heads followed by 2 tails in tossing 1 coin 5 times. 
But we have seen that there are 10 different permutations, all equally 
likely to occur, giving this same combination of 3 heads and 2 tails. Is 10 
then the appropriate factor by which to multiply p^q^ in order to cal- 
culate the probability of the combination of 3 heads and 2 tails? 

Let us now proceed to apply the concepts of proportion and independ- 
ence to a complete generalization of the reasoning underlying Scheme A. 
Expectancy being indicated in terms of proportion instead of frequency, 
and the number of events which may only be favorable or adverse being 
infinite, the combinations may be developed as before. This time p and q 
may have any fixed values subject to the limitation p + 5 — 1, and 
instead of tossing coins we may imagine repetitions of independent 
occurrences. Let us assume that a non-contagious, non-lethal, non- 
immunizing disease organism periodically sweeps the earth uniformly, 
and that in each epidemic the proportion of cases of disease developed is 
a constant q. Then p is the proportion of non-affected persons in each 
epidemic. In the first epidemic there will be a proportion q of cases and a 
proportion p of “escapes.” In the second epidemic, since having had the 
disease before is stipulated to be mthout effect in preventing a second 
case, there will again be the proportions q and p of cases and escapes, but 
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the proportions and will respectively have escaped or developed the 
disease twice, while pq and qp proportions will have become cases once 
only. With a third and succeeding epidemic the same type of reasoning 
will hold, leading to the general plan as below, 

SCHEME D 

^ PHOrORTIONB OF CaSES (o) AND ESCAPES (p) 

Epidemics 



et cetera 

Since the number of the epidemic is the number of cases and escapes in 
each combination, we have the following generalized summary: 


No. OF Events 
IN Combination 
1 
2 

3 

4 


SCHEME E 

Probabilities of Each Combination 

p + q = (p +q)^ 

= (p + 9)^ 
“ (p + 9)® 

p^ + ip^q + Qp^ + ipq^ + g* = (p + q)* 


n p” + np" 1 + • ' • + npg” ‘ + g"* - (p + g)” 

Thus the expansion of the binomial (p + <?)" gives directly: 

(1) the types of combinations of 2 alternative independent events 
in n experiences; and 

(2) the numerical value of the probability of each combination 
when the values of p, q, and n are substituted in each term and the 
product scoured. 

Summarizing, one may state that, if p is the probability of a favorable 
event, q the probability of its alternatives collectively, and if these events 
are mutually independent in occurrence, then a group occurrence of n 
events of which r are favorable and n — r adverse will have a probability 
of occurrence of kp^q^~^, where k is the appropriate coefficient defined by 
the binomial expansion. All one needs is a suitable rule for determining 
k in order to apply the reasoning to any practical problem. 
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The application of this proposition to the solution of problems in 
biology has proved most fruitful in testing independence of alternatives 
and in providing a mathematical graduation of actual data. One may be 
interested in such alternatives as survival or non-survival, sex of off- 
spring, health and disease, success or failure, or a host of such problems 
wherein p and q may be hypothecated or determined from actual records. 
For instance, if sex in the human race is determined solely by chance 
factors, then the expansion of the binomial with appropriate values of 
p and y will give the proportions in which families of the various possible 
combinations of male and female children will occur in the general popu- 
lation. If p equals the probability of completed normal development of 
the human fetus, q equals that of the alternatives collectively, then the 
evaluation of will give the probability of 5 normal births out of 6 
pregnancies if abnormalities are purely chance phenomena. 

The expansion of the binomial may be written in general form as 
follows: 

(p -f- qY = p'^ + np""^ ^ -f * • • 
n! 

+ 7 rn P"'' • -f- np5"-^ -h g", (1) 

[n — T) !r! 


where n! (called factorial n) represents the product of all the inte- 
gers from 1 to n. Numerical evaluation of the general term in this 
expression will give the probability of occurrence of w — r favorable and 
r adverse events in a total group of n events. For those cases where one 
wi.shes to write down the full expansion of a binomial, it is convenient to 
remember the following characteristics of the series. 


(1) Each term consists of the product of a numerical coefficient, a 
power of p, and a power of g. 

(2) The first term is always Ip^g*^ = p"'. 

(3) In succeeding terms the powers of p decrease by unity in 
regular order, while those of q increase in like manner, until the final 
term, Ip^g"" - g”, is reached. 

(4) The product of the numerical coefficient and the power of p in 
any term, divided by 1 more than the power of g, gives the numerical 
coefficient of the following term. Thus, the numerical coefficient of 

the second term will be n, that for the third term will be \ 

and so forth. 


Following these simple rules, one may write down immediately any ex- 
pansion, for example, 

(p -f- g)5 = p5 _[. 5p4gl 10p^^2 ^ 10p2^3 qO 
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Remembering the general term of equation (1) will, however, enable one 
to evaluate any single term without having to write out all the preceding 
terms. 

It may be noted that the number of terms in every binomial expansion 
exceeds the power involved by unity. The possible groupings of any 
size n must comprise types in which the favorable event takes values 
from zero to n, i.e., there must be (n + 1) terms. 

Probabilities of group occurrences derived from application of the 
binomial theorem obviously depend in part for their validity on the 
accurate knowledge of the value of p, the probability of the favorable 
event. Determination of the probability in favor of any given event may 
be a matter of theoretical reasoning in some cases, but it is frequently to 
be expected in biological investigations that its value must be based upon 
actual results. A value of p determined from a finite set of observations 
must be expected to be influenced by sampling errors. Another sample 
of the same size from the same general population would not necessarily 
yield exactly the same value of p. Such repeated samples would yield 
similar values of p only to a certain number of significant figures. 

The dependability of p as defined by a finite experience will be 
analyzed when this discussion turns in a later chapter to the sampling 
errors of proportions. One may be content at this point to recognize that 
values of p determined by recorded experience rather than theory are 
trustworthy only to roughly about one-half the number of digits as those 
comprising the size of sample from which they are calculated. In a total 
of 5,017,632 human births considered, Geissler found 2,582,914 to be 
males. From these data one may calculate the proportion of the “favor- 
able” events (males) as 0.514768. Since N here is a seven-figure number 
the derived proportion is, roughly speaking, stable under sampling errors 
to three or four places of decimals. It would seem foolish, therefore, to 
carry more than four figures in using this proportion as a probability of 
male sex at birth in computing binomial evaluations applicable to other 
sets of data. 

Absolutely crucial to the correctness of any probability derived from 
application of the binomial theorem is the necessity that the combinations 
of favorable and adverse occurrences must be determined solely by 
chance. The appearance of each event must be absolutely independent 
of influence from those with which it is associated. Failure of observed 
results to accord (within sampling errors) with expectancy given by the 
binomial theorem must lead to the conclusion that either (a) the value of 
p used was incorrect, (6) the events comprising the groups did not arise 
independently of one another, or (c) both the above are true. 
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DISCRETE FREQUENCY DISTRIBUTIONS FROM THE BINOMIAL 

One may pass directly to expected frequency of any definite group 
type in a finite experience involving N groups in all, by multiplying the 
group probability by iV. This is merely reversing the procedure of secur- 
ing proportions from observed frequencies. The expansion of the bi- 
nomial + qy results in a series of {n + 1) frequencies which corre- 
spond to the possible group combinations. The successive group types 
in the series of combinations are completely defined by the number of one 
element in the combination. This number proceeds from zero to n (or 
from n to zero) and defines the scale of a discrete variable. Therefore the 
expansion of iV'(p + g)" establishes a discrete frequency distribution 
which defines theoretical expectancy for the variable, provided, of cdurse, 
that the value of p used is correct and the events are truly independent. 
Such distributions have considerable practical value; they also are of 
great theoretical interest. The following tabulation of calculations 
accompanying the solution of the problem indicated will provide a guide 
to the arrangement of work in a binomial expansion giving both group 
probabilities and group frequencies. 

Seeds of Pima cotton were planted in 1,120 hills, each hill being sown 
with 6 seeds. After 2 weeks, 3,320 seedlings were found to form the 
surviving crop. If germination of the seed and survival of any seedling 
are independent of the fate of other individuals with which it is associated 
in the hill, what frequency distribution of seedlings per hill would theo- 
retically be anticipated ? Let p be the probability of survival. Then 


and 


3^220 

6^20 


0.47917, 


q - 0.52083. 


Evaluation of the binomial 1,120 (p + g)^, p and q having the magni- 
tudes just ^veu, will provide the desired distribution. This expansion is 
given in Table 21, wherein five decimal places are retained in all steps 
leading to the group probabilities. This follows a readily justifiable 
empirical rule that the number of decimal places carried should be 1 more 
than the number of digits in N. The theoretical frequency distribution 
(final column) will then be adequately accurate to whole numbers. It 
should be borne in mind that the distribution so secured is that having the 
p of the observed distribution, and it properly ignores in this phase of 
the work the dependability of p as a basis of estimation appropriate to 
other series. 
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It should be noted that the theoretical frequencies as calculated must 
be expected to embody fractional parts. They are related to one another 
strictly in the ratio of the group frequencies in an inhnitely large experi- 
ence. The fractional parts are often eliminated in practical work for 
convenience, but in so doing one is likely to obtain a total differing by 
one or two units from the value of N. This would occur in the present 


TABI.E 21 

Expansion of the Binomial for Seedlings per Hill 


Number of 
seedlings 
per hill 
r 

Theoretical 

probability 

formula 

f 

gn r 

Group 

probability 

Expected 

frequency 

Nkp^q^~‘^ 

6 

ipV 

p® = 0.01210 

gO = 1 

0.01210 

13.6 

5 


p® = 0.02526 

gi = 0.52083 

0.07894 

88.4 

4 


p* = 0.05272 

= 0.27126 

0.21451 

240.3 

3 

20pV 

p^ =0.11002 

=0.14128 ; 

0.31087 

348.2 

2 


= 0.22960 

= 0.07358 ‘ 

0.25341 

283.8 

1 


= 0.47917 

= 0.03832 

0.11017 

123.4 

0 

ipV 

= 1 

= 0.01996 

0.01996 

22.4 





0.99996 

1,120.1 


illustration if the theoretical frequencies were rounded to the nearest 
whole numbers. Perhaps the simplest solution to this problem is to 
adjust the largest frequencies in order by 1 each so as to obtain the cor- 
rect total; that adjustment introduces the least effect of modification 
error on the results. 

THE BINOMIAL SERIES AND THE NORMAL CURVE 

The discrete frequency distribution arising from the expansion of any 
binomial N(p + q)” may be described in terms of its moments just as in 
the ease of the continuous variables for which that procedure has pre- 
viously been developed. The mean and standard deviation of the ex- 
pected number of surviving seedlings per hill given by the binomial in the 
foregoing problem are calculated in Table 22. The values ~ 2.878 and 
(Tx = 1.223 are secured.^ The frequency distribution may be observed in 
the table to be somewhat skew. One might proceed further to secure the 
third- and fourth-moment coefficients in like manner. However, all 
these computations may be circumvented in favor of very simple fea- 

® Parametric notation is used because of the theoretical nature of the distribution. 
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tionships that exist between all the moment coefficients and the values 
of p, q, and n. 

The binomial frequency distribution is ^ven in algebraic form in 
Table 23, together with the first and second moments about zero for each 
group. Derivations A and B show that the mean numbers of favorable 


TABLE 22 

Calculations fok Mean and Standard Deviation of the Theoretical 
Expectancy fob Seedlings per Hill 


z 

f 



0 

22 

0 

0 

1 

123 

123 

123 

2 

284 

568 

1,136 

3 

340 

1,047 

3,141 

4 

240 

960 

3,840 

5 

88 

440' 

2,200 

6 

J4 

1 

84 

504 


1,120 

3,222 

10,914 


3,222. 

Hi = 

2,878. 

= 

10,944. 

N 

9.771. 

2 

(Tx = 

1.495. 

a-x = 

1.223. 


and adverse events are respectively np and nq, and that the standard 
deviation in both cases is V npq. 

Derivation A : The mean number of successes is given by dividing 
the total of column (3) by N. Column (3) is composed of only n ele- 
ments influencing the sum, the first member of the + 1 entries being 
zero. Of these n elements, Nnp is a common factor. When this factor is 
withdrawn the remaining series is the binomial expansion of (p -f- 
Therefore the total is Nnp (p -f Rut (p -f equals unity, as 

any power oi p q does. Thus the mean of the series rcduce.s to 


Nnp 

~Y 


= np. 


( 2 ) 


Derivation B: The standard deviation of number of successes per 
group is given by the usual equation, 
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Now becomes the sum of the products of / and in the table nota- 
tion. In arithmetic calculation this is secured by forming the column of 
values. In the algebraic work of Table 23 we use another arrange- 
ment. Since 

^ jxix, - 1 ) 4 -/^;, 

and column (3) gives the jx products, column (4) is arranged to give the 
jx{x — 1) products. Thus the sum of column (3) the sum of column 
(4) yields of the definitive equation for the standard deviation. The 

TABLE 23 

Algebratc Determination of Moments for the Mean and Standard 
Deviation of a Binomial Frequency Expansion 


(0 

Nuin- 

(2) 

(3) 

(4) 

bpr of 

success 

X 

Frequency 

/ 

fx 

fx{x - 1) 

0 

N 

0 

0 

1 

N npq^~^ 

Nnp q^‘~^ 

0 

2 

w(n-l) , 

- 

Nnp{H — l)pq^ ^ 

Nn{ri- !)/>“’ 

3 

ri(n-l)(ti-2) , 

i)b^' 2) ,, 

Nn(ii~])p- {n — 2)pq'' ^ 

N 0 Q 

2(3) 

I\ np ^ p q 

n 

N 

Nnp p^~^ 

Nn{n — \]p^ 

Total 

Ni'P^gr 

Nnp{p+qy'~^ 

1 

Nn{n-l}pH‘p-\~QV~^ 


reader will see that a factor of - l)p^ is common to ail products in 
column (4). The sum of column (4) reduces to Nn{n - l)p^ (p + 
which is Nn{n - l)p^, since (p + is again unity. Therefore 

2a:2 = Nnin - l)p^ + N7ip, 

2^2 

and - ' = n{7i — l)p^ + np 

N 

= nrp‘^ - np^ + np. 
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Since = np, 

then cl ^ - np^ + np - 

= np(l - p) 

= npq. 

That is, Cx ~ Vnp?- (3) 


When p and g are reversed, the standard deviation will be the same for 
the number of failures. 

The third- and fourth-moment coefficients may be derived in like 
fashion. The final equations are; 

M3 == npqip - q), (4) 

and M 4 = ^P^ + ^np^q^in - 2), (5) 

from which it follows that 


fl _ (p - 
Pi ' } 

npq 


or 


71 


V ^ 
V npq ^ 


( 6 ) 


and 1^2 - 3 = ^ ~ = 72. ( 7 ) 

npq 

From these values it is plain that, when p is not equal to g, the binomial 
distribution is skew, the skewness increasing for any fixed n with increas- 
ing disparity between p and g. However, whatever the values of p and q, 
as n increases the skewness and kurtosis both decrease and approach the 
limiting values of zero, the normal curve values. ThLs transition of the 
discrete binomial series toward the normal curve is a most intriguing one 
theoretically, and of very great importance to the simplification of many 
practical problems involving the binomial expansion. We shall therefore 
reconsider the transition immediately. 

The theoretical range of the discrete binomial distribution is always 
from zero to n; that is, the range increases with n. For any fixed value 
of p, the mean and standard deviation also increase with n. If the dis- 
tributions for increasing values of n are rescaled in terms of relative 
deviates, then the change in form of distribution with p fixed but n 
increasing may be studied geometrically. Under this scheme, as n 
increases, the ordinates defining frequency at the successive points will be 
drawn closer together until, in the limit of n approaching infinite mag- 
nitude, the distribution will become continuous. But, as w — > 00 , skew- 
ness and kurtosis of the distribution vanish, so the limiting distribution is 
the normal curve. The transition to the normal curve is fairly rapid when 
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j} = 5 “ Indication of this is given in Fig. 41, wherein the binomial 
distribution is represented by frequency polygons for n equal successively 
to 2, 4, 8, and 16, ail distributions being scaled in terms of relative devi- 
ates. The corresponding normal curve is drawn with each series. As the 
disparity between p and q increases, so the transition starts from increas- 
ingly skewed forms but moves nevertheless to the same normal curve as 
the limiting form. The skew binomial distributions with p equal to 0.95, 
0.9, and 0.7, and the symmetrical distribution with p = 0.5, are given in 
Fig. 42, wherein n is held constant at 16 and the original scale from zero 
to 16 is preserved. 

FIGURE 41 

Transition of the Symmetrical Binomial Series into the Normal 
Curve as n Increases 



Scale of relative deviates 


Binomials involving large powers are troublesome to expand, not only 
because of the great number of terms, but also because of the small prob- 
abilities which the single terms yield. Since the binomial distribution 
approaches the normal curve as n increases, and this curve is fully tabled, 
it is a great convenience to use the normal curve whenever it provides an 
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adequate approximation. No general statistical specification should be 
given for the adequacy of an approximation, since the importance of the 
error introduced should be judged on non-statistical grounds. However, 
the following empirical rule introduces only slight relative errors and may 
be considered as a first guide. If np (or nq if q is less than p) equals 20 or 
more, the normal curve may replace the binomial distribution for all but 
the most exacting practical problems. Lesser values of np, down to 
np = 10 quite reasonably, may indeed be accepted as usually providing 
adequate approximation. This rule is based upon the degree of approach 
of the binomial beta coefficients to the normal curve values. 

FIGURE 42 

Transition of the Binomial Series, Defined by {p + 5)", to Symmetrical Form 
WHEN n IS Fixed and p Approaches q 



It must be borne in mind that frequency in a continuous distribution 
is given by area corre.sponding to a given range, whereas in the discrete 
distribution it is given by the ordinaie.s alone at given points. If fre- 
quencies for binomial terms are to be approximated from corresponding 
areas in a normal curve, the range to be used in the normal curve must 
be widened from the binomial point or points by one-half unit on one 
side, or on both sides, as the problem demands. 

THE POISSON SERIES 

It has been noted that, when n is small and p is not equal to q, the 
binomial series forms a skew frequency distribution. Just as the normal 
curve provides a very simple basis of approximation to the binomial fre- 
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quency distribution under specified conditions, so the skew binomial 
which arises under certain other conditions is given satisfactorily by the 
exponential series, 




( 8 ) 


where e is the constant 2.71828, the base of natural logarithms. This 
scries, first invented by Poisson in 1837, is generally designated as the 
Poisson exponential series. It may be shown mathematically that the 
Poisson exponential series increasingly approximates the binomial series 
as p - q approaches unity and n is increased sufficiently that m = 7 ip 
(or nq, whichever is smaller) remains finite but not necessarily large. 
Being much simpler to calculate, the Poisson series is often used in place 
of the binomial when (p - g) -> 1 and m is, say, between 0.1 and 10. 
For a given value of p, m varies directly with n, and as the latter is 
increased the Poisson series, together with the binomial series, approaches 
the normal curve as a limit. 

Like the normal curve, the binomial series has two independent para- 
meters if N Ls accepted as unity in both cases. They arc n and p, for q 
is explicitly defined as 1 - p. For the corresponding normal curve the 
parameters are ii' = np and o — ■yj ?ipg. The Poisson series, on the other 
hand, has but one parameter, m, which is taken as the product np when 
the Poisson scries is being used as an approximation to the binomial. The 
standard deviation of the Poisson series may be shown to be cqnal to 
\/m; that of the binomial is \/ mq. Thus the binomial can transform 
precisely to the Poisson series only in the limit of g — > 1. They are, how- 
ever, remarkably alike near this limit, and the Poisson series Ls so much 
easier to calculate that it is widely used as an adequate approximation. 

It would be quite erroneous to infer from these remarks that the use- 
fulness of the Poisson series is confined to its approximating character to 
the binomial series. Although this exponential series was first reached 
as a limiting form for the binomial, it is available for application to any 
observed discrete distribution starting at zero for which the mean and 
squared standard deviation are sufficiently in agreement. In this con- 
nection it is widely used as a graduating function for discrete distribu- 
tions because of its very simple application. One may illustrate its use 
in this connection in terms of one of the experiments of Student in gradu- 
ating the frequency distribution of cell counts with a haemacytometer,** 
Student used m equal to the observed x as his parameter in the Poisson 
series. One might choose alternatively to use m equal to si. Since in 

^Student. On .the error of counting with a haemacytometer. Biometrika, 5: 
351-360. 1907. 
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the observed series x does not equal s^, and discrepancy between these 
must be expected through errors of random sampling in finite series, the 
question arises which is the better one to use. Since the sampling errors 
of X are considerably less than those of s^, the reason for choosing the 
former as a definition of m is obvious. 

In the Poisson series as defined in equation (8), note that the first 

TABLE 24 

Actual and Fitted Poisson Sebies for the Distribution op Number op 
Erythrocytes on Haemacttometer Squares 

Data from Student ^ 


Number of 
cells per 
square 

X 

Observed 

frequency 

m 

X 

Poisson series 

Theoretical 

frequency 

Rounded 
frequency * 

0 

103 


106.6 ■ 

107 

1 

143 

1.3225 

141.0 

140 

2 

98 

0.66125 j 

93.2 

93 

3 

42 , 

0.44083 

41.1 

41 

4 

8 

0.330625 

13.7 

14 

5 

4 

0.2645 

3.6 1 

4 

6 

2 

0.220416 

0.8 

1 


400 


400 

400 


Ix = 529. 

- 1,213, 
si = 1.2835. 


X = 1.3225. 

N 

- 1.1329. 


= 3.0325. 


m for the Poisson expansion taken as equal to x. 
Then Ne ~ antilog of (log N - m log e} = 106,6. 


term ]sNe Each successive term may be derived from its predecessor, 

tn 

as in Table 24, by multiplying it by — , being the scale value of the 

X 

variable. These terms total to N, of course. To secure probabihties 
instead of frequencies, the first term may be evaluated as c"”, from which 
point the procedure is the same. Logarithms are usually necessary to 
calculate e~^, since m rarely proves to be a whole number. 

• Since taking the nearest whole number gives a total frequency of 401, we have dropped 1 from 
the highest frequency. Making the first frequency 106 might be preferred by some. 
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THE PROPORTIONS OF VITAL STATISTICS 

One of the many specialized domains of application of the statistical 
method, the field of vital statistics deals with the organization and 
analysis of data pertaining principally to the health, length of life, and 
reproduction of more or less large units of population. Vital statistics 
then assumes an important role in the broad movement aiming to 
describe the well-being and social organization of mankind. Many 
variables of diverse nature enter into consideration in this field of 
inquiry. Some of them, such as age and physical dimensions of man, 
are susceptible of fairly precise determination through measurement. 
The majority of the variables of interest, however, are subject only to 
qualitative description. In the introductory Chapter it was suggested 
that the complexities involved in describing the health of human beings 
must result in attainment only of a rather crude division into a few 
general classes. In contrast to this, sex is determinable with accuracy 
in all but an inappreciable proportion of individuals. With all such 
qualitatively defined variables the description is made by categorical 
designation, the accuracy of the classifications changing from those of 
high degree to others of but vague implication. . In general, the individ- 
uals to be described are recognized to fall somewhere in a wide variation 
scheme, in which two or more subclasses are definable as parts of the 
range which may be assembled to form the total range of variation. 
Thus an individual may be recognized as being alive or dead with respect 
to the state of vitality; or being in good, medium, or poor health; as 
belonging to one of many well-defined nationalities or several inter- 
mingling races; as being male or female with respect to sex; et cetera. 
Descriptions of individuals in such terms involves the recognition of 
qualities innate to them at the time of observation. Such qualities 
may well be designated as attributes of the individuals. The statistical 
analysis of such data pertaining to a population is often called the 
statistics of attributes, to distinguish the type from the statistics of 
measured variables. 

The qualities of the population with which the student of vital 
statistics is concerned are in the main related to the general subjects of 
mortality, morbidity, and natality. The appearance or existence of the 
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quality may be designated as the event or state. For purposes of termin- 
ology, one may use the term event in a general sense here to include also 
those occurrences more properly designated as states. Thus one may 
refer to records of death, sickness, and births as records of events, vital 
statistics being concerned with description of the frequency of occur- 
rence of such events in a comparative way. 

Therc is not suitable opportunity to concern oneself herein with the 
interesting problems of securing the basic data comprising the counts of 
the events and of the populations in which they may take place. Pre- 
liminary reading on this subject may be found in textbooks concerned 
principally with vital statistics, among which may be cited that of Pearl 
entitled ‘‘Medical Biometry and Statistics.” Suffice it to say that the 
main burden of accumulating such records falls naturally to the lot of 
civil governments, most of which provide special administrative agencies 
to care for this work. The difficulties that face their officials in securing 
full and accurate records are formidable, and in some degree insuperable. 
The element of inaccuracy in the basic data must therefore never be lost 
sight of. Sometimes the data are very crude. On occasion they arc so 
selectively incomplete or inaccurate that it is questionable whether they 
are better than no information at all, if not worse. In many other situa- 
tions they are highly dependable, and of course all gradations may be 
expected between these extremes. 

The statistical methods of resolution of data, on the other hand, may 
be, and indeed usually are, highly refined. One may pertinently choose 
to give here a simile previously suggested. Like a scalpel in the hands 
of a surgeon, the statistical method depends for the practical effective- 
ness of its work on the skill of the person applying it, not only in his 
knowing how to use the instrument but also in understanding the mate- 
rial to which it is applied. The student of vital statistics must be 
doubly careful to scrutinize his basic data for the accuracy of description 
and coverage involved before applying to them the formal procedures of 
statistical analysis. It is primarily with that method of analysis that 
one is concerned herein. 

RATES 

The vital statisician, and many another engaged in social studies, is 
assigned the task of describing the frequency of occurrence of specified 
events in a population. The description required must be in terms cap- 
able of ready comparison. The task might well be designated as one of 
describing the force of incidence of the event. By way of example one 
may consider the intensity with which death, regardless of the cause, 
impinges on a population. What was the force of mortality in the state 
of Maine in 1930, for instance, and how does it compare with that in the 
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state of Montana in the same year? To say that the number of deaths 
in the two areas were respectively 11,082 and 5,440 is inadequate 
description, for obviously if the force of mortality was tlie same the 
deaths would be reasonably proportional to the respective populations, 
other things being equal. This very simple idea of proportionality 
provides immediately a clarification of what is specifically implied by 
the term force of mortality) it is described by the proportion that 
deaths form in a population in a given time period. Such proportions, 
measuring the force of incidence of an event, may be calculated for differ- 
ent types of events wherever the basic data are available. Expre.ssed 
as numbers on scales ranging from zero to some suitable power of 10, 
these proportions are commonly known as rates in vital statistics. 

Two numerical elements are necessary for the determination of true 
proportions convertible to rates. They are: (a) a count of the event of 
reference, and (5) a count of the population in which the event can take 
place. The reader may be asked to note carefully this specification of a 
particular logical relationship between these two counts. A tendency to 
destroy the descriptive value of the ratio will be introduced if the popu- 
lation used is other than the one exposed to risk of occurrence of the 
event. Since death may come to anyone, the whole population is the 
proper base for establishing the general death rate from all causes. In 
measuring the force of mortality through pregnancy, however, only 
those in the pregnant state are exposed to the risk of occurrence of the 
specified event; to use any broader population must introduce an cle- 
ment of confusion into the description. 

RATES AS APPROXIMATING EXPRESSIONS 

The ideal rate describing incidence of ari event would result from 
the “follow-up” of a population through a period of exposure to the 
event. Thus, ideally, an annual d('atli rate from all causes would give 
the proportion that deaths formed among a population living at a point 
of time and enumerated with respect to the incidence of death through- 
out the following year. Let L be the number of individuals forming the 
living populations at a point of time x. Let L+i be the number of sur- 
vivors from that population 1 year later. Then the number of deaths 
dx in the year of observation would be given by 

dx — lx ~ L+ii 

and the proportion of deaths or “per capita death rate” for the year 
would be definable as 

dx 
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Such a rate as logically used as a measure of expectancy applicable to 
a comparable population, would measure the probability of death facing 
that population in the ensuing year. 

The securing of records of individual histories such as are called for 
above is obviously out of the question as a civil administrative procedure 
in large populations. The clerical work involved in assembling annual 
reports by all living persons would alone forbid the practice. 

Bearing in mind the practicability of procedures aiming to permit 
description of the vitality characteristics of a population, it might be 
demanded merely that specified events such as births, deaths, and inci- 
dence of readily communicable diseases be reported to civil administra- 
tions. In this connection it may be noted that streams of migration 
continuously permeate civil jurisdictional boundaries, more or less 
steadily changing the population within them. Also, those who die 
cease to be reporters of events to any terrestrial government; it must 
be demanded of others to report on them and on the sick. Thus only 
those births, cases of sickness, and deaths occurring within specified 
areas may at best be expected to be reported. These may take place in 
part among infants and immigrants as new arrivals during any time 
interval, similar events among emigrants from the area being recorded in 
some other jurisdiction. 

The factors mentioned above must be considered in the construction 
of an appropriate mortality rate. The population is changed more or 
less progressively by operation of those factors, and the deaths are 
properly referable only to that changing population. An estimate is 
therefore needed of the average population in the interval. Assuming 
the changes to have taken place fairly uniformly throughout the period, 
the logical population to accept for construction of the mortality rate 
becomes that at the mid-point of the time interval, that is, as of July 1 
if the period is the calendar year. This population must usually be 
estimated from interpolation in, or extrapolation of, the population 
growth curve as indicated by the periodic censuses and accessory bases 
of inference. The general death rates of vital statistics are in practice so 
computed, and other rates fall in the same class. They are logical 
approximations to the ideal figure which is defined by purely theoretical 
considerations. 

The death rate used above, describing the force of mortality regard- 
less of cause as it impinges on a general population, has been discussed 
primarily as a specific illustration of an expression approximating to an 
ideal proportion. Another mortality rate commonly used in medical 
statistics may be contrasted with it from this point of view. In the 
general population, individuals are likely to come under continued 



RATES AS APPROXIMATING EXPRESSIONS 


187 


observation because of some disease that has been contracted by them, or 
because of surgery or other treatment to which they have submitted. 
Such persons may be classified according to the ailment from which they 
suffer, and be followed until recovery or death takes place. The propor- 
tion that deaths from the condition form among the cases is a mortality 
rate determined from recorded histories and therefore accords with the 
foregoing specification of an ideal rate. Commonly known as a case 
fatality rate, it is not an approximating expression in the sense of the 
general death rate previously considered. The case fatality rate may 
be completely accurate for the type of case coming under observation 
provided it may be accepted that death was the consequence of the ail- 
ment prescribed. 

Some conditions which may terminate in death of the patient, and for 
which a rate description of the force of incidence of death is desired, do 
not come under observation in a way permitting of the construction of 
an accurate case fatality rate. An outstanding example is that of 
deaths resulting from the puerperal state. Although cases of pregnancy 
leading to death of the mother are recorded, the total number of preg- 
nancies to which the deaths should properly be related in construction 
of the ideal rate is not readily determinable with accuracy. In lieu of 
this most appropriate denominator for the expression it is customary to 
use the number of live births through the same period of time. One 
need not discuss the degree of approximation involved here beyond 
remarking that the base is accepted as being better than the alternative 
ones which may be generally available in official records. The derived 
rate is known as the rmiernal mortality rate. Its deficiencies as an 
approximation are compensated for to some degree by fairly dependable 
comparabilily of the rate from one region to another. 

The three foregoing types of mortality rates must serve to indicate 
that the rates of vital statistics may vary considerably with respect to 
attainment of the ideal form of description given by the proportion that 
a part forms of the whole. Whereas some rates approach closely to 
desired perfection in description, others may approximate the ideal 
expression of proportion rather coarsely. Practical considerations not 
uncommonly forbid the attainment of ideal goals in science. Hates may 
often be constructed from components which do not bear the mathemat- 
ical relationship to one another which would be desired by precision 
objectives. They may nevertheless be logical approximations having 
the quality of comparability in the same way as the unattainable ideal 
ratios of a part to the whole. 
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THE PERTINENCE OF RATES TO PLACE AND TIME 

The counts of populations and events incident to them, from which 
descriptive rate expressions may be formulated, pertain to national or 
political subdivisions of the earth's area. Specification of the area of 
reference for any rate is, of course, a necessary part of the description. 
Likewise, definition of the calendar period in the history of the popula- 
tion through which the event is counted is essential to allocation of the 
rate, for most rates have been changing progressively through time. 
Each numerically determined rate must pertain to some area of reference 
and period of historical time. The italicized words in the following 
illustrative statement are essential to interpretation of the numerical 
rates given : The case fatality rate from scarlet fever in the city of Provi- 
dence has fallen from 15.1 per cent in the quinquennium 1884-1S88 to 2.7 
per cent in 

TEMPORAL RATES 

Although all rates pertain to some time period, historically speaking, 
two distinct classes of rates exist with respect to the functional influence 
of length of time involved on the rate itself. This may conveniently be 
illustrated in terms of two of the mortality rates previously discussed, 
the “general death rate" and the “case fatality rate.” The former is the 
ratio of the count of an event through a time interval, to the count of the 
appropriate population at a point of lime. Time does not cancel out in 
the ratio; the numerical value of the rate will itself be roughly propor- 
tional to the length of the interval through which the event is counted. 
Such rates should always be adjectively qualified by an expression of the 
length of time interval involved. Thus death rates may ’be annual, 
quarterly, monthly, et cetera, and their numerical values under constant 
conditions will be proportional to the length of the interval. The annual 
death rate in Minnesota in 1930 was 10.00 per 1,000, whereas that for 
the two months of June and July combined was 1.62 per 1,000, which is 
approximately one-sixth of the annual rate. The population used in 
both periods is identical (that of July 1), whereas the number of deatfis 
in each period is very different. 

Case fatality rates may be taken as representative of the class of 
ratios in wliich both numerator and denominator represent counts 
through the same time interval. Under such conditions time cancels 
out to leave the rate as a pure number. The quinquennial case fatality 
rates for scarlet fever just given for the city of Providence may also be 
defined as the average annual rates over the same periods. A five-year 
^ Vide infra, Table 25. 
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span was used merely to strike a more representative value for the general 
period. The basic data are given in Table 25, wherein the case fatality 
rates by single years may be compared with those of the five-year periods. 

TABLE 25 

Reported Cases and Deaths fhom Scarlet Fever in Two 5- Year Periods 
FOR THE City of Providence, Rhode Island 
(Data from Chapin •) 


Year 

Cases 

Deaths' 

Ca.se fatality 
rate per 100 

1884 

538 

57 

10.6 

1885 

383 

38 

9.9 

1886 

237 

30 

12.7 

1887 

848 

153 

18.0 

1888 

361 

79 

21.9 

1881-8 

2,367 

357 

15.1 

1918 

367 

18 ' 

4.9 

1919 

528 

11 

2.1 

1920 

533 

14 

2.6 

1921 

319 

6 

1.9 

1922 

161 

2 

1.2 

1918-22 

1,908 

51 

2.7 


THE NUMERICAL POPULATION BASE 
Another problem peculiar to the numerical specification of a rate is 
that of the standard size of population referred to. One may conveni- 
ently return to the question of the death rates for 1930 in Maine and 
Montana. The basic data are given in Table 26. 


TABLE 26 

Calculation of General Death Rates, Maine and Montana, 1930 


(1) 

(2) 

(3) 

(D 

(5) 

Area 

Deaths in 
1930 

Population f 
July 1, 1930 

Proportion 
of deaths 

Death rate 
per 1 ,000 

Maine 

Montana. . 

11,082 

5,440 



798,140 

537,330 

0.0139 

0.0101 

13.9 

10,1 


* Annual Reports of the Superintendent of Health of the City of Providence for the years 1916- 
1922. Providence: The Oxford Press. 1923. 

t Estimate made by assuming that census figures of January 1, 1920, and April 1, 1930, are 
dependable and that the average three-month population change between the dates continued to 
July 1, 1930. 
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Verbal statement of the proportions given in column (4) might lead 
one to say that, for every person in Maine, 0.0139 of a death occurred 
in 1930. The proportion given is wholly analogous to an average, and 
as such its fractional character is not inappropriate. However, a frac- 
tion of a death lacks ready comprehensibility to the lay mind. This 
difficulty is partly circumvented by changing the proportion from a 
‘‘per capita” basis to a “per 1,000” basis, or to any other suitable basis 
yielding whole numbers of deaths. It is in such form that the ratios are 
designated as rates. One or two decimal places are frequently retained 
after the whole numbers even in the rate form when the greater numer- 
ical precision provided by more significant figures is considered desirable. 
The “per 1,000” basis is accepted by convention as a standard for the 
annual rate of deaths from all causes because it yields small whole num- 
bers easy of comprehension. On the same grounds, decennial death 
rates if used might well be given as per 100. Other classes of rates 
require different standard bases to achieve these same ends, as may 
be noted later in this chapter. In the absence of an acceptable standard 
basis common to all rates, it is obviously absolutely essential that the 
basis be defined whenever rates are ^ven in numerical form. 


SPECIFIC RATES 

Why, one may well ask, did the force of mortality in 1930 appear to 
be so much higher in Maine than in Montana? Was it much healthier 
to live in Montana? Acknowledging that this does not seem reasonable, 
one might turn attention to the examination of factors influencing these 
rates, factors that so far have been overlooked in our preliminary con- 
sideration. Were the populations of the two states comparable with 
respect to their composition so far as that may affect the force of mortal- 
ity? The risk of imminent death is well known to be higher for the old 
than for the young. Perhaps there was a greater proportion of older 
people in Maine. The general risk of death may perhaps differ also by 
broad racial groups (white versus negro, for instance), by occupation or 
urbanization, by sex, and so forth. Many such questions arise for con- 
sideration in seeking a solution of discrepancies observed in the com- 
parison of two or more areas or time periods. 

It is of value to notice that all these suggestions emanate from recog- 
nition of the existence of relatively more homogeneous elements into 
which the general population may be divided. Death rates, for instance, 
may be computed within such specific classes of the general population. 
Two or more areas may be compared in terms of these specific class 
rates with greater refinement than is possible when heterogeneity is 
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ignored. Such rates for relatively homogeneous subgroups in the popu- 
lation may be designated for purposes of distinction as group specific 
rates. By comparison the rate for the heterogeneous population as a 
whole is called a crude rate^ which indeed it is. 

An age specific rate is one calculated for a specified age group in the 
population. There will, of course, be as many of these rates arising for 
one general population as there are age groups recognized in that popu- 
lation. To consider every year of age separately would give a large 
number of classes, larger than is warranted by the change in the risk of 
death. It is because of this that broader classification than by single 
years of age is often used. The coarser grouping chosen Ls usually an 
irregular one, for experience shows that the risk of death does not change 
uniformly over the life span. This may be seen in the table that follows. 

Within each of such primary groups a further subdivision into racial 
classes, say white and colored, might be made. If there were 10 age 
groups, then there would be 20 age-color classes, 40 age-color-sex classes, 
and so forth. Now an age specific rate is relatively crude compared to 
an age-color specific rate, and so on. Thus a progression in degrees of 
specificity has been established by the concept of refinement in homo- 
geneity of recognizable subclasses of any general population. The 
original crude rate corresponds to what might be called a ^‘zeroth order” 
of specificity, single-factor specific rates being of the first order of speci- 
ficity, two-factor specific rates being of the second order, and so on. 
According to this terminology, the order of specificity of a rate is the 
number of independent criteria of classification which are used to specify 
the ultimate classes into which the population is- to be divided. 

Discussion of this subject in terms of actual data must be confined 
herein merely to illustration of a first-order specification. In consider- 
ing the contrast in the Maine and Montana crude death rates above, it is 
reasonably obvious that the most likely source of differentiation which 
would help account for the disparity in the crude death rates would be the 
age composition of the populations. One may proceed to investigate 
this by calculating the age specific death rates. The basic data and 
derived rates are given in Table 27, where a conventional age classifica- 
tion is used. The population estimates by age have been derived by 
assuming that the relative distribution by age at the census date, 
April 1, persisted without change to July 1. 

Comparison of the rates of death from all causes within the specified 
age classes in Table 27 reveals a striking similarity for the two states 
considered. This is in sharp contrast with the disparity between the 
two crude rates given against ages” at the foot of the table. Obvi- 
ously, the difference in the crude rates is not due to a greater force of 
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mortality in Maine^ age for age, but to the difference in age composition 
of the two populations. Recognition of specific age classes within the 
populations has thus contributed greatly to clarification of the basis of an 
initially surprising discrepancy. It has, however, introduced a new 
problem of comprehension in that the single pair of crude rates is now 
replaced by 16 pairs of age specific rates. The large number of differ- 
ences so arising is not readily assimilable as a whole beyond recognition 

TABLE 27 


Age Specific and Total Death Rates in Maine and Montana, 1930 


Age 

Maine 

Montana 

Estimated 

population 

Deaths 
in 1930 

Death rate 
per 1,000 

Estimated 

population 

Deaths 
in 1930 

Death rate 
per 1,000 

^1 

14,442 

1,227 

84.96 

9,834 

583 

59.28 

1 

14,594 

130 

8.91 

9,522 

81 

8.51 

2 

15,573 

80 

5.14 

9,832 

43 

4.37 

3 

15,186 

49 

3.23 

9,878 

43 

4.35 

4 

15,365 

57 

3.71 

10,202 

32 

3.14 

5- 9 

79,858 

148 

1.85 

53,998 

90 

1.67 

10-14 

74,183 

104 

1.40 

56,402 

108 

1.91 

15-19 

68,796 

153 

2,22 

50,141 

146 

2.91 

20-24 

60,674 

224 

3.69 

43,758 

166 

3.79 

25-29 

53,197 

183 

3.44 

38,199 

171 

4.48 

30-34 

52,700 

230 

4.36 

35,480 

152 

4.28 

35-44 

101,358 

552 

5.45 

82,696 

534 

6.46 

45-54 

90,495 

980 

10.83 , 

63,476 

702 

11.06 

55-64 

72,697 

1,476 

20.33 

37,209 

783 

21.04 

65-74 

46,690 

2,433 

52.11 

20,236 

959 

47,39 

75^ 

22,432 

3,056 

136,23 

6,407 

847 

130,97 

All ages 

798,140 

11,082 

13.88 

537,330 

5,440 

10.12 


of the fact that, in general, they are relatively small. Graphic com- 
parison of the age specific rates by means of two “trend lines” in Fig. 
43 materially aid& comprehension of this latter point. However, the 
attendant problem of greater sampling instability of the specific rates, 
because they are necessarily based on smaller numbers of people, arises 
for consideration. The significance of the difference between any pair 
of specific rates may be expected to be of a quite different order from 
that of the difference between the two crude rates, 
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CORRECTED RATES 

There will, in general, be as many specific rate comparisons as there 
are subclasses designated. In the foregoing illustration the broad age 
groupings in the regions where the rates are not changing markedly ^ are 
de.sirable because they give greater sampling stability to the rates 
secured. Where the total populations are not large, or the number of 

FIGURE 43 

Crudb, Age Specific, and Age Corrected Death Rates. Mortality from All 
Causes in Maine and Montana, 1930 



specific classes is considerable, the populations per specific class may be 
so reduced as to render some or all of the specific rates of questionable 
value, whereas the crude rates may have been fairly dependable from the 
point of view of sampling errors. In view of such possibilities, if not 
from other bases of reasoning, one may naturally raise the question: 
What would the total rates have been if the populations had not differed 
in composition with respect to the character made specific? 

Despite apparent reasonableness of this question on first considera- 
tion, careful analysis readily shows that it is unanswerable without the 
use of populations other than the true ones in which the deaths took 
place. Answers to the question must tend to carry one away from 

^ The very broad group, 75 — is imposed by lack of more detailed published data 
on age at death in this range. 
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reality, for hypothetical populations must be invoked. The accept- 
ability of the transfer will, of course, depend on the usefulness of the 
information which the answer yields. 

The basic inquiry in constructing corrected rates is to ascertain the 
total number of deaths that would have resulted if the group specific 
rates for each observed population had prevailed on some other popula- 
tion used as a standard. Such standard populations may be structurally 
theoretical but derived from actual population counts of past censuses, or 
the actual populations themselves may be used. Also, those populations 
may, on the one hand, be quite independent of, or on the other hand 
derived from, the ones of reference in the specific rate correction prob- 
lem undertaken. The principle of keeping as close to reality in selection 
of the standard population as conditions will permit has much to com- 
mend it. To return to the data on deaths in Maine and Montana, this 
principle may be observed in either of two ways: 

(o) The age specific rates of one state may be applied to the popu- 
lation of the other to determine an “expected number of deaths'^ 
for comparison with those actually occurring. This procedure 
obviously will be possible in two directions if there are but two states, 
or in n(n — 1) directions if there are n states, yielding much con- 
fusion if n is large. 

(6) The age specific rates of each state may be applied to the 
combined population of both states to secure “expected deaths.’^ 
With n states there will be only n such sets of “expected deaths/' 

When the general principle is considered, rather than the anticipated 
comparison of only two areas, method (h) above would be selected as 
the more satisfactory. It is applied to the comparison of Maine and 
Montana in Table 28, wherein the last two columns, calculated from 
the preceding three columns, give the “expected number of deaths" if 
the force of mortality by age characteristic of each state had afflicted 
the population of the two states combined. The total numbers of 
“expected deaths" are found to be 16,690 and 16,104 respectively, 
derived from the age specific rates of Maine and Montana. This 
rather close agreement in the pair of comparable figures stresses the 
very similar force of mortality that characterized the two areas when 
the age composition of their populations is accounted for. The number 
of deaths may in each case be converted to a rate by dividing by the 
total population of 1,335,470 in which they are calculated to occur. 
These total death rates, determined as 12.50 and 12.06 per 1,000 respec- 
tively, are known as age corrected (or age adjusted) rates. 
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It is unfortunately all too easy to overlook the hypothetical nature 
of corrected rates. To state from the calculations that 12.50 per 1,000 
is the age corrected general death rate for Maine in 1930 is quite erron- 
eous. The population used in arriving at the corrected rate was that of 
Maine and Montana combined. Other populations might have been 

TABLE 28 


Calculation of Age Corrected Total Death Rates for the Comparison op 
Maine and Montana, 1930 


Agfi 

group 


Population 


Observed deaths 
rates per 1 ,000 

Expected deaths * 
under rates of 

Maine 

Montana 

Total 

Maine 

Montana 

Maine 

Montana 

-^1 

14,442 

9,834 

24,276 

84.96 

59.28 

2,062 

1,439 

1 

14,594 

9,522 

24,116 

8.91 

8.51 

215 

205 

2 

15,573 

9,832 

25,405 

5.14 

4.37 

131 

111 

3 

15,186 

9,878 

25,064 

3.23 

4.35 

81 

109 

4 

15,365 

10,202 

25,567 

3.71 

3.14 

95 

80 

5- 9 

79,858 

53,998 

133,856 

1,85 

1.67 

248 

224 

10-14 

74,183 

56,402 

130,585 

1.40 

1.91 

183 

249 

15-19 

68,796 

50,141 

118,937 

2.22 

2.91 

264 

346 

20-24 

60,674 

43,758 

104,432 

3.69 

3.79 

385 

396 

25-29 

53,197 

38,199 

91,396 

3.44 

4.48 

314 

409 

30-34 

52,700 

35,480 

88,180 

4.36 

4.28 

384 

377 

35-44 

101,358 

82,690 

184,054 

5.45 

6.46 

1,003 

1,189 

45-54 

90,495 

63,476 

153,971 

10.83 

11.06 

1,668 

1,703 

55-64 

72,597 

37,209 

109,806 1 

20.33 

21.04 

2,232 ' 

2,310 

65-74 

46,690 

20,236 

66,926 

52.11 

47.39 

3,488 

3,172 

75^ 

22,432 

6,467 

28,899 

136.23 

130.97 

3,937 

3,785 

Total 

798,140 

537,330 

1,335,470 

13.88 

10.12 

16,690 

16.104 

Age corrected total death rate per 1,000 

12.50 

12.06 


chosen, yielding perhaps substantially different figures. This is illus- 
trated in Table 29, wherein the theoretical life table population of 1910 
for the original registration states is used as a standard. The age cor- 
rected general death rates therein derived from the Maine and Montana 
age specific rates are 14.98 and 14.43 per 1,000 — figures substantially 
higher than those secured from Table 28. 


* In the total population. 
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The numerical value of any corrected rate is not descriptive of any 
occurrence. The meaningful value is the relative magnitude of that 
rate when compared to some other corrected rate calculated on the sarm, 
population. The ratios of the age corrected death rates for Maine and 
Montana derived from the two standard populations are given in Table 

TABLE 29 


Calculation of Age Cohrected Total Death Rates for the Comparison of 
Maine and Montana, 1930 


Age 

. 

Observed death rates 
per 1,000 

Life table 
population * 

Expected deaths f under 
age specific rates of 

Maine 

Montana 

Maine 

Montana 

1 

84.96 

59.28 

17,841 

1,516 

1,058 

1 

8.91 

8.51 

16,916 

151 

144 

2 

5.14 

4.37 

16,612 

85 

73 

3 

3.23 

4.35 

16,448 

53 

72 

4 

3.71 

3.14 

16,338 

61 

51 

5- 9 

1.85 

1.67 

80,682 

149 

135 

10-14 

1.40 

1.91 

79,628 

111 

152 

15-19 

2.22 

2.91 

78,513 

174 

228 

20-24 

3.69 

3.79 

' 76,802 

283 

291 

25-29 , 

3.44 

4.48 

^ 74,717 

257 

335 

30-34 , 

4.36 

4.28 

72,342 i 

315 

310 

35^4 

5.45 

6.46 

135,917 , 

741 

S78 

45-54 1 

10.83 1 

11.06 

121,068 1 

1,311 

1,339 

55-04 : 

20.33 , 

21.04 

' 98,831 , 

2,009 ' 

2,079 

65-74 

52.11 , 

47.39 

65,377 1 

3,407 1 

3,098 

75-- 1 

136.23 i 

' 130.97 

31,968 i 

4,355 

4,187 

i 

Total j 

' 13.88 

10.12 

1,000,000 1 

1 14,978 

14,430 

Age corrected total death rate per 1 ,000 

14.98 

14.43 


30. These ratios arc essentially identical despite the discrepancies in the 
age corrected rates provided by the two populations. The important 
feature revealed by the new rates is that the force of mortality for Maine, 
when adjusted for age, is only 4 per cent greater than that appropriate 
to Montana, whereas the crude death rate for Maine was nearly 40 per 
cent greater than for Montana. 


* U. S. Original Registration States, lyiO Life Table, 
t In the life table population. 
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RATES SPECIFIC FOR SUBDIVISIONS OF THE EVENT 

The division of populations into more homogeneous subclasses 
often provides a crucial refinement in formulating rate comparisons be- 
tween areas or points in time. Perhaps there is no more striking illus- 
tration of this than flows from the widely varying age specific death 
rates, ranging from less than 2 per 1,000 about 10 years of age to a 
theoretical hmiting value of 1,000 per 1,000 at the end of the human 
span of life. In introducing rate expressions, death from any cause is a 
convenient event to choose because existence of the state of death is 
rarely open to controversy and it is an event to which all are exposed. 
This particular event and its companion morbidity are, however, con- 
ventionally subject to subclassification according to some so-called 
“cause” of the death or sickness. We must forego here any considera- 


TABLE 30 

Comparison of Age Corrected Total Death Rates, 
Maine and Montana, 1930 


Ratio 

Standard population 

Combined actual 

Life table * 

Maine 

12.50 

14.98 

1 04 

Montana 

1 

— i*U^ 

12.00 

14.43 


tion of the arbitrary nature of such subclassifications with their fre- 
quently great errors of judgment, and concern ourselves solely with 
recognition of a new category of specific rates, namely, those termed 
came specific rates. 

There is no difficulty in accepting cause specific rates as a descriptive 
type. A problem more or less peculiar to them, however, is to give 
them suitable numerical expression. This arises from the widely vary- 
ing proportions which they take on a “per capita” basis. Some causes 
of sickness or death are common; others are extremely rare; and no one 
standard size of population is eminently suitable as a rate basis for all. 
In only very few cases, for example heart disease, or cancer and tumors 
collectively, is a population unit of 1,000 a suitable basis yielding small 
whole numbers for such rates, the point being that although large num- 
bers are exposed relatively few are smitten. Accordingly, convention 


* U. S. Original Registration States, 1910 Life Table. 
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for some time stipulated 100,000 as the base for numerical expression of 
cause specific rates. 

Increasing control over certain diseases is already making the “per 
100,000” too small a base where before it was adequate. A quarter of a 
century ago the annual death rate from typhoid fever in Minnesota was 
approximately 11 per 100,000, but for the last ten years it has been 
less than 1 per 100,000. Thus some cause specific death rates need to be 
quoted as “per million” or a higher power of 10 to preserve whole num- 
bers in the rates. One must always be alert to ascertain the numerical 
population base on which cause specific rates are quoted. One should 
also preserve a wholesome regard for the effects which differences of 
nomenclature or classification procedure, frank or illusive errors of 
judgment, and changing pressure of medical opinions and public reac- 
tion may have on the comparability of cause specific mortality or mor- 
bidity rates between different areas or points of time. 

SOME WIDELY USED RATES DEFINED 

In closing these general comments on the proportions of vital sta- 
tistics it is perhaps pertinent to remark that the diverse expressions of 
rate have been developed over a span of years without any organized 
plan of nomenclature or critical review of the different types which have 
come into existence. Probably the most important matters for the 
statistician to recognize in dealing with these descriptions are those listed 
below. 

(1) They all correspond to simple proportions and, in statistical 
analysis procedure, fall directly in that class of measures. 

(2) Their magnitudes are dependent on multiplying a numerical 
magnitude on the standard probability scale by some power of 10, 
that power being more or less conventionally set up for the particular 
type of rate in question. Without a clear statement of this mul- 
tiplier, which is simply a scale transformation factor, rate expressions 
are useless. 

(3) Many rates (but not all) are a function of time in the sense 
that the magnitude of the rate is proportional to the length of the 
time interval.^ Such rates to be meaningful must have an accom- 
pan 3 dng statement of the time interval involved. 

(4) Most rates of vital statistics are expressions approximating to 
some ideal ratio which is not itself readily determinable. Under- 
standing of the nature of this approximation is essential to thorough 
appreciation of the descriptive quality and usefulness of the rate 
involved, when comparative work is undertaken. 
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Some of the more commonly used rates are listed here for reference. 

dcoih TdiCt the number of deaths irrespective of cause in a time interval 
per 1,000 population at the mid-point of the interval. 

Cause specific death rate: the number of deaths ascribed to the specific cause in a 
time interval, per 100,000 population at the mid-point of the interval. 

Case fatality rate: the number of deaths from a specific disease, per 100 reported 
cases of that disease, both counts being made througli the same specified 
interval. 

Maternal mortality rate: the number of deaths directly attributable to preg- 
nancy, per 1,000 live births, both counts being made in the same specified 
interval. 

Infant mortality rate: the number of deaths in the first year of life, per 1,000 live 
births, both counts being made in the same specified time interval. 

The neonatal mortality rate similarly considers deaths within the first month 
of life. 

Birth rate: the number of live births in a specified interval, per 1,000 population 
at the mid-point of the interval. 

This very crude rate is often refined to include in the denominator only 
women who arc in the reproductive years of life (15 to 45, or 10 to 60). 

Subdivision may be made for legitimate and illegitimate births, with 
appropriate denominators. 

Stillbirth rate: the number of stillbirths, per 100 live births, both counts being 
made through the same specified tune interval. 

Morbidity rate: the number of cases of disease, in a given time interval, per 
1,000 or per 100,000 population at the mid-point of the interval. 

This rate is usually cause specific. 



CHAPTER 14 


SAMPLING ERRORS OF PROPORTIONS 

It is a generalization of fairly wide acceptance that 10 per cent of all 
cases of typhoid fever may be expected under present conditions to 
terminate in death of the patient. This anticipation is based on consider- 
able experience with reported cases of the disease, the case fatality rate 
derived from “follow-up’' studies in large areas appearing in the last 
decade or two to have remained essentially unchanged. With a view to 
studying the sampling distribution of proportions in terms of familiar 
situations and actual numbers one may for present purposes accept this 
proportion of case fatalities as being correct. 

Consider now 1,000 sets of 10 cases each of typhoid fever, chosen at 
random from a supposedly infinitely large supply. One may proceed on 
the basis of the binomial theorem to determine how often the number of 
recoveries per group of 10 cases must be expected to be 10, 9,8, • • • , zero. 
Let the probability x of recovery be accepted as 0.9; then the probability 
of death, 1 — tt, becomes 0.1. Expansion of the binomial 

1,000[t -f (1 - 

will give the expected numbers of groups with 10, 9, 8, ei cetera, recov- 
eries per group. The arithmetic expansion is given in Table 31, 
wherein it may be noted that roughly one-third of the experiences mth 
sets of 10 cases will show recoveries without any deaths, one-third with 
one death, and one-third with 2 or more deaths. These results are in- 
evitable consequences flowing from the assumption that each set of 10 
cases is assembled by random selection from a general supply having a 
case fatality rate of 10 per cent. 

One may direct attention now to these sets of 10 cases each. The 
number of recoveries per sot, taken as a proportion of the number in the 
set, defines the probability of recovery so far as each set alone is capable 
of defining that probability. These probabilities p, based on finite ex- 
perience, are given in the final column of Table 31. 

The reader will now discern the reason for using tt as the probability 
of recovery in the original expansion. It is the assumed true or para- 
metric value for the probability of recovery, the values p ^ven by the 
samples of 10 being corresponding statistics derived from finite experi- 
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ence. The p values will be observed to differ from one another solely 
through errors of random sampling, for by definition the individuals com- 
prising the samples were drawn at random. Thus the errors of random 
sampling in observed proportions, p, are explicitly defined by the bi- 
nomial expansion based on the true proportion, tt, of infinitely large 
experience. For samples of size 10, one may expect once or twice in 

TABLE 31 

Determination of the Expected Numbers of Recoveries in Sets of 10 Cases 
OF Typhoid Fever 
Probability of recovRiy, tt = 0.9 


Recoveries 
per 10 cases 

Group 

probability 

Frequency 
per 1,000 sets of 

10 cases 

Recovery indicated 
by the set, 
in per cent 

10 

0.348678 

349 

100 

9 

0.387420 

CO 

QO 

CO 

90 

8 

0.193710 

194 

80 

7 

0.057396 

57 

70 

6 

0.011160 

11 

60 

5 

0.001488 

1 

50 

4 

0.000138 



3 

0.000009 



2 

0.000001 



1 

0.000000 



0 

0.000000 



Totals 

1.000000 

1,000 

1 


1,000 sets of *10 cases that half the patients will die when the true proba- 
bility of death is only 10 per cent. For the same reasons slightly more 
than one- third of all samples will show no deaths at all. 

It has already been proved^ that the mean number of "favorable 
events" in a binomial dLstribiition is np, the standard deviation being 
\/ npq. Changing p to tt to distinguish the parameter from the equiva- 
lent sample statistic, and dividing hy the number of events per group 
to reach the probability scale from the frequency scale, it follows that 
the equations 

= TT, (1) 

and a = J ”- (2) 

’ n 

* Point of least relative error of nioditication to give total of 1,000. 

^ Vide page 176 el seq. 
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define the mean and standard deviation of p based on random samples 
of size n. When the size of sample is small, the individual terms of this 
discrete distribution may be evaluated without difficulty. When n is 
large, it has been shown in Chapter 12, the normal curve defined by 
equations (1) and (2) above will give adequate approximation to the dis- 
tribution probabilities. The problem of sampling errors in probabilities 
has therefore been rather completely analyzed in these pages, the needed 
theoretical distributions being established. 

THE STANDARD ERRORS OF PROPORTIONS AND RATES 

The true standard deviation of the sampling errors in proportions p is 
defined in equation (2) above as being dependent on the value of tt in the 
supply. Without knowledge of this value it is impossible to define that 
standard deviation precisely. One is therefore obliged to estimate tt 
when it is desired to attach to the value of p yielded by a unique sample 
some measure of its sampling stability. Equation (1) above is of interest 
in this connection in that it shows that p is an unbiased statistic. On the 
average, p is equal to tt. The best estimate of ir given by a unique 
sample is therefore the sample value of p. Substituting this in equation 
(2), one secures an estimate of Cp, which estimate must appropriately 
bear designation as the standard error of p: 



Rates have been defined as true proportions or logical estimates of 
proportions expressed on a scale from zero to some specified power of 10. 
One might well designate the upper limit of that scale as the numerical 
population base for the rate. Let B be that base and R the rate. Then 
the basic proportion of the standard probability scale leading to the rate is 


R 



the rate being equal to Bp, It follows directly then that the standard 
error of any rate is defined by the equation 



Approximate confidence intervals with respect to t may readily be 
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established on the basis of equation (3) provided that n is large enough to 
warrant use of the normal curve for the sampling distribution of p. The 
situation then becomes identical with that wherein a confidence interval 
is to be defined for the mean of a supply, starting from a sample mean and 
its standard error. The 95 per cent confidence interval for tt will cover 
the approximate range from (p - 1,96 S.E.p) to {p + 1.96 S.E.p). Other 
confidence intervals may be established in analogous manner. 

A practical illustration of the sampling trustw^orthiness of a propor- 
tion may be given in terms of the very large series of records of sex at 
birth accumulated by Geissler.^ He found that among 5,017,632 human 
births the number of male offspring was 2,468,305. If p is the derived 
probability of male sex at birth, then 

p = 0.514768, 
and q = 0.485232. 

From this one determines that 

/0.514768 X 0.485232 

^ wsa 

= 0.000223. 

The 95 per cent confidence interval for locating the probability of male 
sex at birth that would be given by an infinitely large experience is there- 
fore from 0.5143 to 0.5152. It would thus seem inadvisable to use the 
given p to more than three or four places of decimals in discussions of the 
probability in general of male sex at birth. 

CONCORDANCE OF p WITH w 

It has been noted that on the average p is identical with the param- 
eter TT. In individual samples, however, p must be expected to vary 
about TT in a binomial distribution with standard deviation given by 
equation (2). One may therefore readily test whether a value of p given 
by a sample is consistent with any assumed value for tt. If n is small, the 
binomial expansion may be used to find the probability that the deviation 
p ~ TT would arise solely through errors of random sampling. When n 
is large the normal curve probabilities may be used as providing a good 
approximation. 

One may turn again to a specific problem for illustrative development 
of the principle. Diphtheria deaths reported for the state of Minnesota 

^ Arthur Geissler. Beitr^e zur Frage des Geschlechtsverhaltnisses der Geborenen. 
Zeitsehr. K. Sachs. Statistischen Bureaus, 35; 1-24, 1889. 
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in 1935 show the following distribution by sex: male, 5; female, 11. It 
being assumed that the cases of diphtheria were equally divided between 
the sexes, what is the probability that a distribution of deaths between 
the sexes as uneven as this is due to chance? 

The small number of cases involved in this problem makes direct 
expansion of the binomial quite feasible. In the absence of differential 
mortality rates between the sexes, the expected proportion of male 
deaths, among all deaths, would be defined by tt = 0.5. Expansion of 
the binomial [tt + (1 — will therefore yield the probabilities of each 
possible numerical combination of male and female deaths through 
random sampling. It is necessary now to examine carefully the question 
propounded, so that the appropriate probability may be derived from 

TABLE 32 

Probabilities of 5 or Fewer Male Deaths lv Groups op 1G Persons, Where 
TT =0.5 


Number of deaths 

Group 

probability 

Males 

Females 

0 

16 

1 XO.5^^ 

1 

15 

16 X 0.5^^ 

2 

14 

120 X0.5^“ 

3 

13 

560 XO.5^® 

4 

12 

1,820 X 0.5^® 

5 1 

11 

4,368 X 0.5^« 

Total 

6,885 XO. 51 ^ 


this binomial. Fewer than 5 male deaths out of 16 represents a greater 
departure from equality than is given by 5 alone, and therefore all com- 
binations from zero to 5 male deaths show at least as great a departure as 
that specified. Also, the combinations involving 5 or fewer female deaths 
represent a departure from equality as great as that specified. The 
question fundamentally centers about the deviation from equality, not 
merely the ratio pf 5 males to 11 females. Thus the sum of the prob- 
abilities given by the combinations involving 5 or fewer male or female 
deaths is required. Since the binomial is symmetrical, only the first six 
terms need be evaluated; double their sum will give the requisite 
probability. 

Evaluation of the probabilities corresponding to the specified six 
terms of the binomial with tt = 0.5 is given in Table 32. Since ir is equal 
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to 1 - TT, then 7r"(l - tt)”-'' is equal in all terms to The nu- 

merical coefficients of the successive terms may be written in directly, 
following the simple rule given earlier.^ The total probabiKty derived 
from the specified terms then becomes 

P = 2[6,885 (0.5)ifi] - 0.21. 

Thus slightly more than 20 per cent of the time, with samples of 16, a 
distribution of deaths between the sexes as uneven as that observed 
would arise solely through random sampling errors when the expected 
division is equal. 

Another problem of like nature may be considered. A report on an 
outbreak of typhoid fever in Mankato, Minnesota,^ in 1908, records 35 
deaths from the disease among 511 cases. What is the probability that a 
departure as great as this from the general expectation of a 10 per cent 
mortality rate is due to chance? 

The answer to this problem would be given precisely by expanding 
the binomial [t + (1 - u)]”, where tt = 0.1 and n = 511, the terms 
being summed from zero deaths up to and including 35 deaths. Such 
computation would be onerous indeed. The product of n and tt in this 
problem is far in excess of 20, a value suggested in Chapter 12 as indicat- 
ing that very close agreement exists between the normal curve and the 
true binomial distribution, higher values giving even closer agreement. 
Normal curve areas may therefore be used in this problem with confidence 
as providing very close approximations to the true binomial frequencies. 

The continuous nature of the approximating curve contrasts here 
with the discrete character of the binomial distribution it is to replace. 
This differeiice should properly be adjusted for by recognizing that the 
total probability corresponding to 35 and fewer deaths in the binomial is 
equivalent to the relative area under the curve up to 35.5 deaths. Under 
the hypothesis being used, namely, that x is 0.1, 51.1 deaths would be 
expected. Therefore the normal curve, scaled in terms of number of 
deaths, is centered at 51.1 and has a standard deviation V nx(l — x) 
equal to 6.7816. The relative area below 35.5 deaths may be found in 
Appendix I after calculation of the relative deviate, 


35.5 - 51.1 
~67816 


2.30. 


One finds P = 0.0214. 

® Vide page 172, also page 250. 

* Journal of Infectious Diseases, 9: 410-474. 1911. 
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Only once in 47 experiences with sets of 511 cases each would deviation 
from expectancy as great as that observed be anticipated through sam- 
pling errors alone. Possibly in this epidemic the typhoid bacillus was 
somewhat attenuated, or not all the reported cases were of typhoid fever 
but some were milder ailments of somewhat similar symptoms. In any 
event, the validity of the hypothesis that the general mortality rate for 
the ailments dealt with was 10 per cent is open to grave suspicion. 

Precisely the same conclusions must follow if the numerical expres- 
sions take the form of proportions instead of frequencies of deaths; 35.5 
deaths out of 511 is a proportion (p) of 0.0695, the mean and standard 
deviation of the sampling curve now being defined by equations (1) and 
(2) above as 

= 0 . 1 , 


and (Tp = 0.01326. 


Then 


p — fip 0.0305 
~ OOl^ 


2.30, 


the value previously secured. The same value would follow if case 
fatality rates in percentage are used, the changes throughout being linear 
transformations of scale and having no influence on the pure number 
relative deviate finally secured for entering the table of the normal curve. 


THE SIGNIFICANCE OF THE DIFFERENCE BETWEEN TWO PROPORTIONS 

The foregoing problems deal with the deviation of an observed pro- 
portion from some specified value which is prescribed by theoretical con- 
siderations or based on a sufficiently wide experience to be accepted 
tentatively as the true proportion. A second class of closely related 
problems deals with the consistency of two proportions derived independ- 
ently from finite samples, the parameters being unknown. Attention 
will now be given to such problems in terms of actual data. 

The distribution between the sexes of both cases and deaths in the 
Mankato typhoid fever epidemic referred to above is given in Table 33. 
What is the probability that the difference in the case fatality rates for 
the two sexes is due to chance? 

Note that this question does not deal with the deviations of the 
respective rates from any specified value ; it is concerned solely with the 
difference between them. One is required to test the null hypothesis, and 
any assumptions made must be in accord with that hypothesis. Now 
the general sampling errors in rates appropriate to each size of sample 
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being dealt with are a function of tt, the true or parametric proportion. 
An estimate of t is therefore demanded, and if the null hypothesis is true 
then the best estimate available will be given by pooling the experience 
for both sexes. In the “total” array in Table 33 the case fatality rate is 
given as 6.85 per 100, making 0.0685 the logical estimate of tt to be 

TABLE 33 


Cases and Deaths from Typhoid Fever, by Sex 
Mankato Epidemic, 1908 


Sex 

Cases 

Deaths 

Case fatality 
rate per 100 * 

Male 

221 

16 

7.24 

Female 

290 

19 

6.55 

Total 

511 

35 

6,85 


employed. The standard errors of sampling in the observed rates (R) 
will then become respectively: 

n S.E.I S.E.ft 

221 (men) 2.8872 1.6992 

290 (women) 2.2003 1.4833 

The values of nw do not fall far below 20, and normal curves may be 
accepted as good enough first approximations to the actual sampling 
distributions of the rates involved. The observed rates for men and 
women, of 7.24 and 6.55 respectively, may tentatively be presumed to 
belong in normal curves of mean 6.85 and standard errors as given above. 

In establishing the sampling distribution of differences between 
means paired at random,^ the solution of all random sampling difference 
problems originating in normally distributed statistics was established. 
It was there shown that, under the null hypothesis, 

Md = 0, 

and 

* Two decimal places are carried here solely for numerical accuracy in the derived computationa. 

® Vide Chapter 10, page 138 et seq. 
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where d is the difference between the two statistics z and y. By trans- 
ferring to the rates (Rm and R^) and replacing true standard deviations 
by standard errors, it follows that 

S.E., = Vs,E.i. + S.E4 

= V2.8872 + 2.2003 

= 2.26. 

Now d = 7.24 - 6.55 = 0.69, 

and the relative deviate of this difference in its normal sampling dis- 
tribution is given by 

From the tables in Appendix I, 

P = 0.76, 

indicating an entirely insignificant difference between the case fatality 
rates for the two sexes. That such a conclusion would be in order was 

TABLE 34 

Test of the Null Hypothesis Applied to Age Corrected Total Death Rates, 
Maine and Montana, 1930 

Data derived from Table 27 



Maine 

Montana 

, Total 

Population 

798,140 

537,330 

1,335,470 

Expeeted deaths • 

9,975 

6,479 

16,454 

Death rate, R, per 1,000 

12.50 

12.00 

12.32 

S.E.|t 

0.0152 

0.0226 


Difference in rates, d 

0.44 


S.E.d 

0.194 


k 

2.27 


P 

0.0232 



obvious from the fact that the standard error of each rate was consider- 
ably greater than the actual difference between the rates. 


* Reduced to actual population bases, 
t Baaed on r = 0.01232. 
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A comparatively small difference between rates based on large num- 
bers of cases may, of course, be quite significant. By way of illustration 
one may take the age corrected total death rates derived from the Maine 
and Montana experiences in 1930. Table 34 gives the data and calcula- 
tions pertinent to testing the null hypothesis in that investigation. It 
should be noted carefully that the population bases used in testing the 
significance of the difference between corrected rates must logically be no 
larger than those from which the basic age specific rates are derived. 
Therefore in Table 34 the ‘^expected deaths’^ are reduced to the numbers 
appropriate to the recorded populations of Maine and Montana. 

The probability of the difference between the age corrected total 
death rates in Table 34 being due to chance alone proves to be very small. 
One might well consider this significant and perhaps continue the study 
by correcting for sex, “urbanization,” et cetera, as well as age, to deter- 
mine whether the finer group specific forces of mortality are not more 
nearly identical. Since the principles of procedure have been set forth 
above, the analysis will not be continued herein. They may be under- 
taken by the interested reader in order to test his grasp of that which 
has been presented. 

The foregoing study of sampling errors in proportions provides widely 
useful techniques for testing the concordance of proportions considered 
by pairs. It has been snown that the analyses may proceed in terms of 
proportions, rates, or actual frequencies, it being immaterial which form 
is adopted. Attention will now be given to the broader problem of the 
concordance of an array of pairs of proportions. The analysis will 
proceed for convenience entirely in terms of frequency. 



CHAPTER ]5 

THE MEASUREMENT OF FREQUENCY DISCORDANCE 

One of the great intellectual contributions of Gregor Mendel to 
biological science was that he foresaw and demonstrated the possibility of 
stating the results of biological experimentation in simple mathematical 
expressions. Realizing that forces of chance incidence obscured agree- 
ment between the results of his plant hybridization experiments and those 
he theoretically anticipated, he labored diligently at repetition of his 
work to minimize the effect of chance through having large numbers. 
Finally he was convinced and felt that others would be convinced of the 
adequacy of his theory on the basis of his great and precious store of 
evidence. Yet Mendel did not ultimately secure perfect agreement be- 
tween the observed frequencies of occurrence of his phenotypes and those 
which he theoretically expected. The actual differences were relatively 
small. He attributed them to what the statistician would call ‘^errors of 
sampling” and rested his hypothesis on the ground that the observational 
facts approached the theoretical conception more and more closely as the 
number of observations was increased, 

Mendel lacked an objective criterion of the goodness of agreement 
between observed and theoretically expected frequencies. It was not 
until 1900, 34 years after Mendel published his work and in the year when 
it was rediscovered, that Karl Pearson provided a mathematical criterion 
to meet Mendel’s need. Pearson was not concerned specifically with 
Mendel’s problem, for he undoubtedly did not know of it when his 
memoir was written. He was considering such problems in general, and 
the discussion immediately following will be given with the same general 
outlook, illustrated with specific examples. 

Considerable attention has already been directed to the fact that, if 
the probability tt of an event is known, and if such events and their 
alternatives may occur independently in groups of size n, then the prob- 
abilities of the possible group combinations are given by the numerical 
values of the separate terms in the expansion of the binomial [tt + ( 1 - x)]^ 
Such an expansion, however, yields ideal values to which actual observa- 
tions can approximate only within the errors of random sampling. If two 
“perfect’’ coins be randomly tossed 4 times, the results will not necessarily 
be 2 heads, head and tail, and 2 tails, in the ratio of 1 : 2 : 1 . Each toss will 

210 
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yield one of these combinations, but the results of the second, third, and 
fourth tosses will be quite independent of what happened previously. In 
the long run, however, as the number of repetitions of the experiment is 
increased, closer and closer approximation to the theoretical ratio of 
1:2:1 will be reached. Lack of exact agreement between the observed 
and theoretical frequencies must be expected purely through chance, and 
such discrepancies will not in any sense invalidate the theoretical fre- 
quencies. 

Admitting that exact agreement between observational and theo- 
retical frequencies should not be expected in any finite experience, one is 
faced with the difficulty of distinguishing between agreement which is 
satisfactory and that which is unsatisfactory. Any one person's judg- 
ment will probably not agree with that of all others. The theoretical 
expression may indeed fail to portray the true distribution law. This 
brings one face to face with a problem of the utmost importance. One 
must have an objective criterion of the closeness of agreement of the 
observed with the theoretical frequencies. Without such a test the 
critical worker may well remain hopelessly confused as to whether the 
observations fully substantiate the theoretical distribution of frequency 
or not. 

Consider the general case of two series of frequencies distributed into 
identical classes, the one series arising from observation and the other 
from theoretical deduction. Within each class let o designate the ob- 
served frequency and c the calculated frequency, the total frequency for 
each series being N. Then o — c will measure the absolute discrepancy 
in each frequency class. Let n' designate the number of such classes. 
Then the simplest possible summary of the discrepancies will be secured 
by adding the* values of o — c. But 

n' n’ n' 

S(o — c) = So — Sc 


= A- A. 

This summation obviously defeats the purpose; on the whole there will 
always be no discrepancy by this method. 

The difficulty again is a matter of signs. It may be surmounted as 
before by considering the second moments instead of the first. The total 

discrepancy may be measured as S(o — c)^, a positive quantity of mag- 
nitude Ey say. Now E will vary in the aggregate with the magnitude of 
the individual differences, and hence E would appear to fulfill the require- 
ment of measuring the discrepancy between observation and theory. 
However, it suffers one severe limitation. The importance of any given 
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absolute discrepancy depends on the number expected. If theory should 
call for a frequency of 500 in a certain group and 502 cases are observed, 
the concordance is relatively very much greater than if 5 cases were 
expected and 7 were observed. Since the discrepancies in the classes 
must finally be summed, the measures of discrepancy should be in com- 
parable terms. The frequency discordance should then be measured on a 

relative basis, that is, a. or 

C 0 

In choosing between these two forms of measuring the relative dis- 
crepancy in each class, two considerations are of predominant impor- 
tance. First, when c is small in magnitude o may conceivably often be 
zero solely through errors of sampling. In such cases the relative dis- 

crepancy would be indeterminate if — is used. To any suggestion 


that c may be zero and o a greater magnitude, thus making 


(o - c)2 


indeterminate, the answer must be clear. If theory does not provide for a 
frequency that does occur, then the theory is obviously inadequate and it 
is foolish to attempt to measure concordance. Quite apart from this issue 
of determinability of the measure itself, there is the second consideration 
that the calculated frequency c represents a wider experience or broader 
conception than the frequency o shown by the finite sample ; c should then 
form the logical reference value, 

(o — c)^ 

The value is then proposed as a measure, comparable from 

c 

one situation to another, of the importance of discrepancies between 
theoretical and observed frequencies. One may with benefit consider 
this measure of frequency discordance in relation to a numerical example. 
If 6 coins were tossed at random 64 times, the number of “heads" being 
recorded for each toss, a frequency distribution could be prepared for the 
observed numbers of heads per toss. Now 6 coins, each with a probability 
of 0.5 of showing the “head" face after random tossing, must show an 
average distribution of heads in 64 tosses as given in the c column of 
Table 35. The column o in that table gives frequencies empirically 
selected as quite typical of the results of an actual tossing experiment. 

(o - c)2 

The differences o — c vary from zero to 4, the derived elements 


c 

varying from zero to a maximum of 1. The magnitudes of the latter 
elements in relation to the importance of the deviations o — c comprise 
the matter of immediate interest. It will be profitable to study this 
relationship with care. 
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TABLE 35 


Illustbating the Comparability of the Elements - as Measures of 
Frequency Discordance ^ 



Frequencies 

Discrepancies 

0 — c 

c 

Heads 
per toss 

Calculated 

c 

Observed 

0 



0 1 


1.0 \ 

5 

6 

8 

+2 

0.67 

4 

15 

12 

-3 

0.6 

, 3 

20 

24 

+4 

0,8 

\ ^ 


\ 14 

1 

Q.07 

1 

' 6 

5 

-1 

0.17 

0 

1 

1 

0 

0.0 

Totals 

64 

64 

0 

3.31 


The following two points may well be noted from the data of Table 35. 


(1) The three unit deviations (for 6j 2, and 1 “head”) yield 
(o — c)^ 

values inversely proportional to the expected frequency. 


c 

That is, as the expected frequency goes up, the importance of any 
fixed deviation from it of observed frequency goes down. This was 
deliberately chosen as a requirement of the measure of frequency 
discordance. 

(2) When the deviation o — c is proporiioml to the expected fre- 
quency, as in the “4 heads” and ^‘3 heads” groups, the values of 
(g — c)'^ 

are directly proportional to the expected frequencies. This 

c 

most important quality of the measure of frequency discordance arises 
tlirough squaring the deviation but retaining the denominator c in the 
first pmver. Just as a deviation of 2 from an expected 500 is of far less 
consequence than a deviation of 2 from 5, so a deviation of 200 from 
500 is of far more consequence than a deviation of 2 from 5. 


The measure .suggested by Pearson therefore fulfills very simply and 
neatly the two outstanding logical requirements for a widely comparable 
measure of discordance between an observed frequency and its theo- 
retical equivalent. 
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Establishment of an elemental measure of the lack of agreement be- 
tween the frequencies pertinent to each class permits consideration of the 
ultimate problem, that of measuring the discordance in all classes as a 
whole. One wishes to know the probability that the deviations of ob- 
served frequencies from the theoretical in the distribution as a whole are 

attributable to random sampling effects alone. The elements , 

c 

each being proportional to the general importance of the deviations, may 
well be totaled for the entire set of classes. That sum forms the criterion 
designated by Pearson as (called “chi squared'^ • 


(o - c)2 


( 1 ) 


In any practical problem of measuring total frequency discordance, 
such as that presented in Table 35, for instance, is a numerical quan- 
tity without dimension of any sort. It is a pure number derived from 
pure number frequencies. The scale of is obviously limited at zero as 
far as smallness is concerned, but may as.sume very large values when 
there is great discrepancy between the observed and theoretical fre- 
quencies. It is the smallness, numerically speaking, of which reflects 
the degree of concordance of observed and calculated frequencies. There 
is some limit above which values of x^ will indicate lack of satisfactory 
agreement between observation and theory. This limit, however, will 
not be any fixed number applicable to all situations, for x^ is a sum of 
basic elements and its magnitude must inevitably be a function of the 
number of those elements. 

One may choose to calculate a quantity such as x^ with the sole objec- 
tive in view of determining the probability that the total discrepancy it 
measures might be due to errors of random sampling. It is necessary 
then to ascertain the random sampling distribution of x^ derived from n' 
deviations so that one may know how large its value may become solely 
through the chance errors of sampling. This rather complex problem was 
solved by Karl Pearson as part of his classical memoir ^ establishing the 
criterion. Tables of the probability integral of x^ for values of n' from 
3 to 30 were then prepared by Elderton,^ from which it is possible to 


^Karl Pearson. On the criterion that a given system of deviations from the 
probable in the case of a correlated system of variables is such that it can be reason- 
ably supposed to have arisen from randon sampling. Phil. Mag., 50 : 157-175. 1900. 

^W. Palin Elderton. Tables for testing goodness of fit. Biometrika, 1: 155- 
163. 1902. See also Table XII in Part I of Tables for Statisticians and Bio- 
metricians, Cambridge University Press, 1914. 
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determine very easily the probability that a given value of will be 
exceeded through random sampling errors alone. 

In his formulation of the sampling distribution of Pearson recog- 
nized, that to a limited extent, the calculated frequencies are forced to 
agree with the observed frequencies by the necessary requirement that 
2c must equal 2o for comparability. The reader may readily grasp this 
point by returning to Table 35, empirically altering therein the values in 
the c column. He will find that, of the 7 values, he may change only 6 
independently at will, the seventh always being determined by the re- 
quirement that 2c must equal 64 if the series of inscribed frequencies is to 
be comparable with the observed one. That is, only 6 of the 7 deviations 
0 - c are independent, for the seventh must be adjusted so that the total, 
2(o — c), equals zero. Of the elements entering the sum called x^; 
only n' ~ 1 are then truly independent. Pearson's equation and Elder- 
ton’s derived tables take account of this. It was not until nearly a 
quarter of a century later that Fisher® demonstrated that recognition of 
other restrictions limiting the number of independent elements should 
properly and may readily be made. Fisher pointed out that each inde- 
pendent statistic of the observed set of frequencies which is used in 
formulation of the calculated set imposes one restriction on the freedom 
of the c values to deviate from the o values. The number of independent 
deviations, n, in the x^ sum is the total number of elements, n', minus 
this number of restrictions, r. 

n = n' — r. (2) 

The logic of this concept is unquestionably sound, although much 
acrimonious debate springing from conflicting definitions followed its 
introduction.' Let the reader consider for a moment the distribution of 
sex at birth. If we assume that the probability of male sex is 0.5, then 
for Geissler’s series we have : 


Sex 

Fhequentctes 

o—c 

(o-cp 

Observed 

Calculated 

c 

Males 

Females 

2,682,914 

2,434,718 

2,508,816 

2,508,816 

74,098 

-74,098 

2,188.5 

2,188.5 

Totals 

5,017,632 

5,017,632 

0 ' 

x^ = 4, 217.0 


® R. A. Fisher. On the interpretation of fmni contingency tables, and the 
calculation of P. J. Royal Stat. Soc., 85: 87-94. 1922. 
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There is marked discordance between observation and theory, and 

{o - c)2 . 

is not zero. However, the two elements in the sum are not inde- 

c 

pendent. The o — c values are numerically identical and opposite in 
sign, and this must of necessity be true. There is only one independent 
deviation, and the magnitude of must be interpreted on this basis. 
By referring to appropriate tables, one finds that the probability of the 
discordance being due to chance is infinitesimal. What then is the true 
probability of male sex at birth? If one now uses p — 0.514768 derived 
from the observed data, then obviously c will agree with o in each case 
and there will be no discordance, equals zero. Does this prove that 
the value p ~ 0.514768 is correct? Not at all! Theory has been forced 
to agree with observation here in two respects, No = Nc and Po = po 
and any independent deviation is no longer possible. 

The principle illustrated above in simple terms may be extended to all 
situations. The number of independent deviations contributing to the 
sum is crucial to determination of the probability that the lack of 
complete concordance between the observed and theoretical frequencies 
might be due solely to errors of random sampling. This number of 
“independent deviations” is also designated as the number of “degrees of 
freedom,” an expression that was current in the literature of statistical 
mechanics several decades ago.^ Fisher’s adoption of the “degrees of 
freedom” terminology has resulted in its wide dissemination in the 
statistical literature of the last decade. This writer suggests that the 
alternative term “independent deviations” is likely to be more compre- 
hensible to those whose familiarity with mathematical theory is not very 
extensive. It will therefore be preferred in this discussion. 

The sampling distribution of with n independent ’deviations is 
defined by the equation 

it, = c(x^)V (3) 

where Cis a constant determined solely by n. This equation, established 
by Fisher,® is identical with the one derived earlier by Pearson,^ subject 
to the condition that Pearson’s n' be replaced by n + 1- Thus Flderton’s 
tables ^ of the probability integral of x^ still hold, provided that n' therein 
be read as n + 1 rather than as the number of frequency classes. 

* It is of interest to note that the underlying principle of adjusting for imposed 
restrictions was recognized in other connections at least as early as 1823. Vide 
C. F. Gauss. Theoria Combinationis Obaervationum Erroribua Minimis Obnoxiae, 
Pars posterior, Art. 38. Gottingen. 1823. (The author is indebted to Dr. W. 
Edwards Deming for this reference.) 
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The two shaded tails of the curve in Fig. 44 indicate by their truncat- 
ing ordinates the values of below and above which respectively only 5 
per cent of values due to random sampling may be expected to occur 
when there are six independent deviations. Thus, values of below 
1.635 (for six independent deviations) are just as unlikely as values 
above 12.592. One is normally concerned solely with the latter type of 
value, however. It indicates a point above which values of have so low 
a likelihood of occurrence solely through errors of sampling as to cause 
one to believe that, when they arise in practical work, the theoretical and 
observed sets of frequency really are not in good enough agreement to 
substantiate the theory. Elderton’s table ® gives to seven places of 
decimals the relative area in the tail beyond (greater than) successive 
values of x^ from zero to 70. We reproduce in Appendix IV to this 
volume a condensed table of x^ to three places of decimals only, and in 
terms of n, the number of independent deviations. When more detailed 
probabilities are desired, Elderton’s table may be consulted, it being 
remembered that therein n' corresponds to our n + 1. 

FIGURE 44 

Random Sampling Distribution of with 6 Independent Deviations 



The mathematically adept reader will discern from equation (3) that 
the distribution of x^ is of skew form. Its mean is always at x^ = the 
mode being two units less, that is at — 2. The curves for each 

value of n all start at x^ - 0, and extend to infinite magnitudes. For 
small values of n the skewmess is very marked. However, as n increases, 
the skewness gradually diminishes and the normal curve is approached as 
the limiting form of the x^ distribution. This feature is not readily 
demonstrable with a diagram drawn on the x^ scale because the visible 
range becomes very extended as n increases, with consequent fall in the 
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FIGURE 45 

Random Sampling Distributions of fob Several Small Values of 



FIGURE 46 

Random Sampling Distributions^of — for Several Values of n 

n 



These curves are reproduced from drawings very kindly supplied by 
Dr. W. Edwards Deming. 
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ordinate values in general. This is demonstrated in Fig. 45, wherein only 
four consecutive values of n are employed. The difficulty may bc'cir- 

cumvented by transforming to the scale of — • Figure 46 presents a 

n 

series of curves so drawn. The transition towards normality as n in- 
creases may be appreciated more easily from study of this diagram. In 
practical work, however, n is rarely large enough for the normal curve to 
provide an adequate approximation to the random sampling distribution 
of x^- 

Figures 46 and 46 may, nevertheless, be studied advantageously. 
They serve to illustrate the most important point that the probability of 
any particular value of x^ being exceeded is highly dependent on the 
number of independent deviations embodied in its magnitude. 


ILLUSTRATIONS 

The practical application of the x^ test is very simple once the theo- 
retical frequencies are evaluated. For the first two of the following 
illustrations, the theoretical frequencies are secured from the binomial 
expansion. In the third and fourth examples the theoretical frequencies 
are defined by continuous frequency curves. The reader will find in 
Table 24 another set of data to which he may apply the x^ criterion as a 
test of his command of the reasoning involved. 

Exam'ple 1. An illustration in Chapter 12 of the use bf the binomial expansion 
dealt with an experimental situation in which seeds of Pima cotton were planted 6 to 
a hill in 1,120 hills. Two weeks later a detailed count revealed that the surviving 
crop consisted of 3,320 seedlings, distributed in groups as given under o in Table 36. 
The adjoining calculated values c in that table are those derived by the binomial 
expansion and are copied directly from Table 21.® The lack of agreement between 
the observed frequencies and those given by the binomial expansion may be measured 
in terms of Is sufficiently large to warrant the lack of agreement being 
designated as “significant,” and what does “significant” mean here? 

The calculations in Table 36 reveal a very large value of x^ m this case. Though 
there are 7 classes in the sum, it contains but 5 independent elements, for the c 
distribution has been forced to agree with the o distribution with respect to two 
parameters, N and tt. A glance under n = 5 in the table forming Appendix IV to 
this volume is sufficient to indicate that, with ^ so large, the probability of the 
discordance in frequency being due to chance must be essentially infinitesimal. 

Study of the o—c values in Table 36 reveals that the seedlings were more abun- 
dant in the highest and lowest survival classes, and correspondingly deficient in the 
intervening classes, than the independence requirement of the binomial permits. 


Yide page 175. 
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Thus they tended to survive or die as groups rather than as individual plants. Such 
a situation might arise through physical obstacles like patchy hard soil-crust pre- 
venting emergence of the seedlings as groups, or through the presence of diseases 
tending to take all seedlings in the groups they infect. The statistical analysis in 
terms of merely signifies, of course, that the fates of seedlings within the groups 
are far from being independent. 


TABLE 36 

The Crtteeion Used as a Test of Individual Independence 
IN Survivorship of Grouped Plants 


X = number of surviving seedlings from 6 seeds per hill. 


X 

Frequencies 

Discrepancies 

o-c 

(o-cf 

0 

c 

c 

0 

262 

22.4 

-^239. 6 

2562.86 

1 

83 

123.4 

- 40.4 

13.23 

2 

99 

283.8 

-184.8 

120.33 

3 

164 

348.2 

-184.2 

97.44 

4 

217 

240.3 

- 23.3 

2.26 

5 1 

191 

88.4 

-hl02.6 

119.08 

0 

104 

13.6 

+ 90.4 

600,89 

Totals 1 

1,120 1 

1120.1 1 

0.0 

^ X^=3516.09 


n = 5. 

Pjf^i is infinitesimally small. 



ExarnTpU 2. In an effort to demonstrate the applicability of the binomial theorem 
in forecasting the results of experiments, Weldon tossed 12 dice 26,306 times with 
thorough randomization between each throw. The appearance of either the 5 or 
the 6 face was recorded as a favorable event, tt thus being established on a ’priori 
reasoning as one-third. The second and third columns of Table 37 present the 
observed and calculated frequency distributions of number of favorable events per 
throw. Are the experimental results in accord with theoretical expectancy? 

There are 13 possible combinations of favorable and unfavorable events in this 
problem. The expected frequencies for 10, 11, and 12 favorable events are, how- 
ever, quite small. These classes have been grouped in Table 37 because the random 
sampling distribution of involves an assumption that the expected frequency shall 
always be large enough to give a reasonably normal distribution of its sampling 
error. This is undoubtedly achieved when c is 20 or above, but one will not err far 
if c is around 10 or more. Grouping the last 3 classes brings c to 14.3, which will 
be taken as an acceptable value, n' is reduced to 11 by this grouping, and the one 
restriction of equal total frequencies makes n equal to 10. With equal to 35.6, 
P may be found from Elderton’s table (entered under w' - n -f- 1 = 11) to be 
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0.00054. There is therefore only about 1 chance in 2,000 that so large a discrepancy 
would arise through errors of random sampling alone. 

TABLE 37 


Application of the ^ Criterion to Weldon’s Dice-Tossing Experiment. 
Theoretical Frequencies Based on the Binomial Expansion Using r — ^ 

X = niunber of favorable results per toss. 


X 

Frequencies 

Discrepancies 

0 — c 

{o-cf 

c 

0 

c 

0 

185 

202.8 

- 17.8 

1.56 

1 

1,149 

1,216.5 

- 67.5 

3.75 

2 

3,265 

3,345.4 

- 80.4 

1.93 

3 

6,476 

5,575.6 

-100.6 

1.82 

4 

6,114 

6,272.6 

-158.6 

4.01 

5 

5,194 

5,018.0 

176.0 

6.17 

6 

3,067 

2,927.2 

139.8 

6.68 

7 

1,331 

1,254.5 

76.5 

4.67 

8 

403 

392.0 

11.0 

.30 

9 

105 

87.1 

17.9 

3.68 

10 

14] 




11 

4[l8 ^ 

14.3 

3.7 

.96 

12 

oj 1 




Totals 

26,306 

26,306.0 

0.0 

x2 = 35.53 


n = 

10. = 0.00054. 



Lack of independence in such a carefully executed random tossing experiment is 
not likely. One is accordingly disposed to explain the discrepancies between o and 
c by assuming that the dice were biased. This may be tested by calculating the 
probability p of the favorable event from the experimental throws. From Table 37, 

Sox 106,602 ^ „„„„ 

p — = = 0.3377, 

^ 12(26,306) 315,672 

a value decidedly in excess of that of onc-third given by the assumption of unbiased 
dice. The statistical significance of the deviation, p — tt, may be tested by applying 
the technique given in the previous chapter. For tt == y and N — 315,672, 
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for which the corresponding Ph is approximatley 1 in 5,000,000. There is no doubt 
then that Weldon's dice rnust have been biased. 

The technique of tossing may now be tested for its randomization of throws by 
determining a new theoretical set of frequencies having the same probability of the 
favorable event as the observed set. Those frequencies, together with the calcula- 
tions for ^re given in Table 38. There being two restrictions this time, n is 


TABLE 38 

The Criterion Used as a Test op Individual Independence op Dice in 
Weldon’s Experiment 

JT = 0.3376986 = p observed. 

X = number of favorable events per toss. 


X 

Frequencies 

Discrepancies 

o—c 

(o-c)2 

c 

0 

c 

0 

185 

187.4 

-2.4 

0.03 

1 

1,149 

1,146.5 

2.5 

0.01 

2 

3,265 

3,215.2 

49.8 

0.77 

3 

5,475 

5,464.7 

10.3 

0.02 

4 

6,114 

6,269.4 

-155.4 

3.85 

5 

5,194 

5,114.7 

79.3 

1.23 

6 

3,067 

3,042.5 

24.5 

0.20 

7 

1,331 

1,329.7 

1.3 

0.00 

8 

403 

423.8 

-20.8 

1.02 

9 

105 

96.0 

9.0 

0.84 

10 

14 1 




11 

4 18 

16.1 

1.9 

, 0.22 

12 

o) 




Totals 

26,306 

26,306.0 

0.0 

x“=S.19 


n 

= 9. = 0.52. 



equal to 9. For equal to 8.19, the probability of the frequency discordance being 
due to chance is seen from Appendix IV to be slightly in excess of 0.5, indicating 
excellent agreement. Thus Weldon's tossing was good. Probably nearly all dice 
are biased, as material removed in engraving the "dots” is seldom replaced by an 
equal amount of substance of the vSame specific gravity. 

Example 3. An experiment to demonstrate the sampling distribution of means 
has been reported in Chapter 10. Therein, Fig. 31® shows the theoretical normal 


® Vide page 132. 
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curve superimposed on the histogram for the means of 786 samples of 4 individuals 
each randomly drawn from a defined supply. The frequencies witliin specified 
ranges for both histogram and curve are given in Table 39. Is the histogram in 
agreement with the theoretically prescribed curve? 

Only one of the three parameters defining the normal curve was drawn in this 
case from the observed frequencies. This restriction exists in every test, for the 

TABLE 39 

Test of Goodness of Fit of the Theoretical Normal Curve to the Distribu- 
tion OF Means from 786 Uandom Samples op 4. 


Class range 

Frequencies 

Discrepancies 

o—c 

io~cY 

c 

0 

c 

10.85 

71 

4.31. ^ 



10.85-10.95 


7 

4.6 

1.76 

10.95-11.05 

21 

15.9 

5.1 

1.64 

10.05-11.15 

17 

30.5 

--13.5 

6.98 

11.15-11.25 

55 

51.4 

3.6 

0.25 

11.25-11.35 

75 

76.0 

- 1.0 

0.01 

11.35-11.45 

101 

98.6 

2.4 

0.06 

11.45-11.55 

123 

112.0 

11.0 

1.08 

11.65-11.65 

107 

111.6 

- 4.6 

0.19 

11.65-11.75 

88 

97.5 

- 9.6 

0.93 

11.75-11.85 

88 

74.7 

13.3 

2.37 

11.85-11,95 

41 

50.2 

- 9.2 

1.69 

11.95-12.05 

30 

29.8 

0.2 

O.OO 

12.05-12.15 

14 

15.3 

1.3 

on 

12.15-12.25 

6l ^ 





.10 

. .110 

- 1.0 

0.09 

12.25 — ^ ( 

4) 

4.li 



Totals 

786 

786.0 

O.O 

x^ = 16.16 


n = 

13. Px!=0.30. 



total observed and calculated frequencies must agree if the test is to be applied. 
The mean and standard deviation of the curve are those theoretically prescribed; 
they are not derived from the observed distribution. The derived of 16.16 there- 
fore has n = 13 independent elements, and from Appendix IV one finds P = 0.30. 
The theory is therefore well substantiated by the outcome of this sampling experi- 
ment. 

Example 4. The 500 determinations by Birge^ of the width of a spectral band 
of light have been graduated by a normal curve in Fig. 1^. The classified frequencies 

’ These data were kindly supplied to the author by Professor Birge, 

® Vide page 13. 



224 THE MEASUREMENT OF FREQUENCY DISCORDANCE 


are given in Table 40. Does the random error of measurement in determining the 
width of the light band appear to follow the normal law? 


TABLE 40 

Test of Goodness of Fit of the Normal Law to the Distribution of Error 
IN Measuring Width of a Spectral Band of Light 

X — measured width of band in microns. 



Frequencies 

Discrepancies 

(o-c)^ 

X 

0 

c 

o—c 

c 

->1705 

1705-1725 

12 1 

17 


- 1.3 

0.09 

1725-1745 

43 


36.1 

6.9 

1.32 

1745-1765 

61 


70.6 

- 9.6 

1.31 

1765-1785 

105 


101.8 

3.2 

0.10 

1785-1805 

103 


108,5 

- 5.5 

0.28 

1805-1825 

89 


85 3 

3.7 

0.16 

1825-1845 

54 


49.5 

! 4.5 

0.41 

1845-1865 

19 


21.3 

- 2.3 

0.25 

1865-1885 


9 


0.4 

0.02 

1885 -> 

2 f 

Totals 1 

500 

500.0 

0.0 

X ^= 3.94 


,n = G. Px2 = G.GS. 


The question to be answered here deals solely with the form of a distribution and 
is not concerned with the values of the parameters defining the theoretical curve. 
One must logically use the curve of “best fit/’ that is, the one whose definitive 
parameters are derived from properties of the observed data. When fitting a normal 
curve by moments, this involves setting up the following identities: 

Sample Statistic Curve Parameter 


Frequency 

N 


N 

Mean 

£ 


/i' 

Standard deviation . . 

Sx 

= 

<rx 


Thus the curve used is forced to agree with the observed data in 3 respects, leaving 
n - n' ~ S independent elements in the test. Since n' is 9 in Table 40, then the 
number of independent deviations left in is G* Now the derived x^ equals 3.94, 
and one may ascertain from Appendix IV that the probability of equal or greater 
departure from normality arising solely through chance is 68 in 100. The theory of 
normal distribution of error in this case is very well substantiated by the measure- 
ments made. 
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The graduation by a normal curve of the observed data on examina- 
tion scores forming the histogram of Fig. 26 * may be tested as above. 
The data and calculations of that test are given in Table 41 because of 
the special interest of the resulting probability, P^2 = 0.98. The gradu- 
ation given by the normal curve is most unusually good in this case. 
Only once in 50 times would a better fit arise in a random sample. Is it 
“too good,” thus placing the hypothesis of a normal distribution of the 


TABLE 41 


Test of Goodness of Fit of the Normal Law to the Distribution of 
Individual Scores on a “Current Affairs” Examination 


Score on test 

Frequencies 

Discrepancies 

o-c 

io-c? 

c 

0 

c 

-^49 

2 


2.91 



50- 54 

7 



0.7 

0.05 

55 59 

IS 


17.3 

0.7 

0.03 

60- 64 

36 


35.4 

0.6 

0.01 

65- 69 

59 


58.3 

0.7 

0.01 

70- 74 

74 


77 3 

-3,3 

0.14 

75- 79 

88 


82.9 

5.1 

0.31 

80- 84 

72 


71.6 

0.4 

0.00 

85- 89 

44 


49.7 

-5.7 

0.65 

90- 94 

30 


27,9 

2.1 

0,16 

95- 99 

11 


12. 6| 



100-104 

5 

19 

4.6 18.9 

0.1 

0.00 

105-^ 

3j 


I. 7 J 



Totals 

449 

449.0 

0.0 

x' = 1.36 



n = 

7. = 0.98. 



scores in jeopardy? It is true that this experience deviates from average 
expectancy of P^2 — 0.50 just as much as a probability of 0.02, which 
might well be taken to designate a “significant deviation,” Does it also 
signify that something is wrong? There are some who would argue so, 
but the burden rests with them to suggest on logical grounds wherein the 
agreement has been forced. In this specific example there are no grounds 
for such argument. It must be remembered that such probabilities do 
arise through chance, just as sometimes those below 0.02 do. One may 

® Vide page 14. 
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well “raise the eyebrow” at statistical demonstrations that look too good, 
but in so doing the skeptic should not neglect the responsibility so 
assumed. The point should not be overlooked, however, that, when Py 2 
is extremely high, consideration of the possibility that factors forcing 
concordance may be operative is logically in order. The test may 
justify examination of the possibility, but it does not logically signify 
forced concordance in the same sense that a low value of P ^2 signifies 
lack of satisfactory agreement. If lack of satisfactory agreement were 
not a priori a logical alternative hypothesis to that of good agreement, 
then there would have been no point in- testing the goodness of agree- 
ment at all. 



CHAPTER 16 

INDEPENDENCE AND BIVARIATE TABLES OF FREQUENCY 

The reader may have noticed that, although the criterion was 
applied in the preceding chapter solely to testing frequency concordance 
in univariate distributions of quantitatively described variables, the 
reg^soning leading to its elaboration is not so restricted, i^ay be 
applied wherever observed frequencies are to be tested for concordance 
with a theoretical set applying to the same classes, the number of inde- 
pendent deviations 'being ascertainable. The classes providing the fre- 
quencies need not be quantitatively defined; division of the total fre- 
quency into classes representing types on some scale of description is all 
that is necessary. The concordance of theoretical Mendelian ratios with 
the outcome of hybridization experiments may be tested with confidence 
by means of this criterion, provided that the expected frequencies in 
crucTal classes are not too small. 

Application of the technique to bivariate and multivariate fre- 
quency tables to test the independence from one another of the variables 
involved is very simple. It is commonly so employed when the vari- 
ables are described through qualitative classification only. The defini- 
tion of independence previously given ^ automatically provides the basis 
for computation of the necessary theoretical set of frequencies. Ihese 
possibilities will now be investigated in terms of actual data. 

The mouths of 135 women who claimed to have good health (aside 
from dental imperfections) were carefully examined to determine the 
periodontal condition^. Each mouth was classified with respect to 
degree of involvement with pyorrhea according to the arbitrary scale; A 
absent, B slight, C moderately advanced, D advanced and far advanced. 
Careful records were then made of the patients^ diets during a so-c ailed 
average week, each diet being freely chosen. These diets were later 
evaluated for the daily intake of the supposedly important mineral 
elements and vitamins. Formation of bivariate frequency distribution 

^ Vide page 170. 

2 The author is indebted to Dr. Dorothea Raduseh of the School of Dentistry, 
University of Minnesota, for these data. 
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tables for the association of periodontal condition and daily intake of 
minerals and vitamins was thus made possible. Table 42 presents the 
data for daily calcium intake as it was found associated with periodontal 
condition. It is desired to determine whether there is any relationship 
between these two variables. 


TABLE 42 

Observed Distribution of Periodontal Condition and Average Daily 
Calcium Intake in 135 Women 



Periodontal condition 

Totals 

A 

B 

C 

D 

Calcium 
per day, 
in grams 

0.70-^ 

a 

11 

6 

B 

2 

25 

0.55-0.70 

h 

10 

8 

B 

1 

22 

0.40-0.65 

c 


5 

11 

11 

30 

^.40 

d 

B 

4 

26 

23 

58 

Totals 

29 

23 

46 

37 

135 


It is rather diflScult to sense any correlation between the two vari- 
ables in Table 42 through inspection of its frequencies. The necessarily 
categorical description of periodontal condition forbids appeal to the 
form of correlation analysis previously discussed. However, one may 
appeal to the definition that, if two events are independent, the prob- 
ability of their joint occurrence is equal to the product of the probabilities 
of each occurring alone. If periodontal condition and calcium intake 
are independent of one another, then the 29 people having no pyorrhea 
(periodontal type A ) should not differ in their calcium intake from those 
of any other type. They should be distributed in the given classes of 
calcium intake in the same proportions as each of the other four types. 
That is, the individuals in each periodontal group should be distributed 
with respect to calcium intake in the same proportions as are given by 
the group as a whole. The expected frequency in each cell, on the 
assumption of independence of the two variables, is determinable directly 
from the marginal totals alone. 
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The calculation of these expected frequencies is very simple. The 29 
individu^ of periodontal condition A are to be distributed among the 
four calcium intake classes, o to d, according to the respective propor- 
. 25 22 30 58 

tions , and — . Multiplying each proportion by the 

total 29 gives the required number. That is, on the assumption of inde- 
pendence, the expected frequency in each cell is one-Mh part of the 
product of the marginal totals for the two arrays^ in which the cell lies. 
The calculation of may then be proceeded with directly, it being con- 
venient usually to prepare a table large enough to permit of the entries 
being made in each cell to which they belong. This is done in Table 43, 
wherein is found to be 44.6. 

An important problem is to determine the number of independent 
deviations of observation from theory in such tests of concordance. 
Since each array in the table of theoretical frequencies is derived by 
proportioning the observed total for that array, the expected frequency 
in one cell of the array is fixed entirely by what has been assigned to the 
others. If a certain total is to be divided into r parts, then only r - 1 
parts may be written down independently or with complete freedom. 
The value of the remaining part is determined by the restriction that 
the total of the r parts must be a specified value. In the bivariate type 
of table this restriction applies to both the r vertical and the s hori- 
zontal arrays, leaving only (r - l)(s - 1) cells with unrestricted freedom 
to show deviation. The general rule for determining the number n of 
independent elements in as calculated above for bivariate tables is 
therefore very simple. For Table 43, n - (r - l) (s - 1) = 3 x 3 = 9. 

The probability of a x^ value of 44.6 with 9 “degrees of freedom’’ aris- 
ing by sampling errors alone may be determined from Elderton’s table 
to be less than 1 in 100,000. There will be no question then of the 
validity of the conclusion that periodontal condition in these women is 
not independent of the calcium intake in their freely chosen diets. One 
may next ask: What is the nature of the association? This may be 
seen readily by inspection of the differences o — c in Table 43, wherein 
the excess or deficiency of the observed frequency with respect to the 
expected value, assuming independence, is given respectively in bold- 
face and italic type. It will be observed quickly that good periodontal 
condition runs more freely with high calcium, and advanced pyorrhea 
runs more freely with low calcium, than the assumption of independence 
allows for. 

^The term array is used herein for both vertical and horizontal lines of 
frequencies. 
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The reader may have noticed that the calculated frequencies of Table 
43 are for the most part less than 10. The probability derived from 

TABI.E 43 

The ^ Test of IitfDEi'EisrDENCE Applibo to the Data of Table 42 





Periodontal condition 









Totals 




A 

B 

C 

D 




0 


6 

6 


25 



c 


4,3 

8.5 


0.18519 


a 

0 — C 


+1.7 

-2.5 

-4.9 




{o-cf 

c 

5.8 

0.7 

0.7 

3.5 




0 

10 

8 

3 

1 

22 



c 

4.7 

3.7 

7,5 

6.0 

0.16296 

>1 

ej 

i-i 

v 

a 

b 

0—C 

{o-cf 

c 

+6.3 

6.0 

+4.3 

5.0 

-4.5 

2.7 

-5.0 

4,2 


1 


0 

3 

5 

11 

11 

30 

’S 

'rt i 


c 

6.4 

5.1 

10,2 

8.2 

0.22222 

o 

c 

0-C 

-3.4 

-Q.l 

+0.8 

+2,8 




(o-cf 

c 

1.8 

0.0 ^ 

0.1 

1.0 




0 ; 

5 

4 

26 

23 " 

58 



c i 

12.5 

9.9 

19.8 

15,9 

0.42963 


d 

0 — C 

-7.5 

-5.9 

+6.2 

+7.1 




c 

4.5 

3.5 

1.9 

3.2 


Totals 

29 

23 

46 

37 

135 



> 

= 44.6. 

n = 9. 

Byl < 10~ 

5^ 



may then be somewhat misleading. Coarser grouping may be resorted 
to in order to avoid this type of error. The classes A and B, a and b of 
Table 43 are united in Table 44, giving a new (3 X 3)fold classification. 
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The new c values will, like the o values, be the sum of those in the orig- 

io — cV 

inal cells which are now fused. The new ^ elements will not, 

c 

TABLE 44 


Calculation of prom a Coarser Classification of the Data of Table 43 



however, be the sums of those from the united cells. For the new classi- 
fication, falls to 41.8. Since n is now reduced to 4, P becomes less 
than 1 in 1,000,000, a value strengthening the previously formed con- 
clusion. 
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A SHORT-CUT METHOD 

There are soJiiewhat quJcier methods of calculating lor bivarku 
frequency tables wherein a test of the independence of the two variables 
IS to be made. It may be observed that, in general, 

,-- 4 — 

L c 





It ha^ been shown above that, in bivariate tables, c is onc-Mh part of 
the product of the totals of the two arrays in which each cell lies. There- 
fore, if Tx and Ty are those totals, then 

Nc = TxT, 

2 2 

and = (1) 

It is numerically simpler to calculate the values u than the elements 

(o — c)2 . 

in practical work, and therefore the formula 

c 

x" - - 1] (2) 

is often used. The disadvantage of this method lies in the c and o — c 
values for each cell not being determined. One is therefore at a loss to 
know whether the grouping is too fine over any region. Also one cannot 
inspect the arrangement of the signs of o — c in order to determine the 
nature of any association indicated by x^ being significant. 

For illustrative purposes this method is applied in Table 45, where 
the relationship of periodontal condition to average daily intake of 
vitamin C is considered. Broad grouping of the original data is neces- 
sary to secure sufficiently large theoretical frequencies for this material. 
The values of c, which do not appear in these calculations, had to be 
derived independently for this purpose, x^ is very small this time, and 
with n equal to 2 the probability of the observed frequencies arising 
through random sampling from an uncorrelated surface is 0.24. There 
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is very good agreement therefore between the observed facts and the 
hypothesis that pyorrhea and vitamin C intake are independent of one 
another for the patients considered. With diets showing greater defi- 
ciency in vitamin C intake than are represented above, there might 
prove to be significant association with periodontal condition, 

TABLE 45 

Extent of Pyorrhea and the Average Daily Vitamin C Intake in 135 Women 



Periodontal condition, x 

Totals, Ty 

A+B 

C 

D 

Sherman units of 
vitamin C per 
day, y 

CO 

and 

above 

0 

TxTy 

u 


23 

3496 

0.1513 

19 

2812 

0.1284 

76 

Less 

than 

60 

0 

TxTy 

u 

18 

3068 

0.1056 

23 

2714 

0.1949 

18 

2183 

0.1484 

59 

Totals, Tx 

52 

46 

37 

135 

= V[Zw - 1] = 2.85. n = 2. ' P = 0.24. 


THE FOURFOLD TABLE 

The situation is not at all uncommon, particularly in medical and 
sociological investigations, wherein characters under analysis may be 
divisible into no more than two types. In other inquiries it is a matter 
of expediency to coarsen the grouping in a more finely divisible variable 
so that only two contrasting classes are set up. Situations arise therefore 
in which the relationship between two variable characters are defined in 
a bivariate frequency table which has but four cells. Such tables are 
commonly known as “fourfold tables.” Sex, marital status, survival, 
success, existence, maturity, etc., represent familiar conditions with 
rcvspect to which description may be confined to two alternative classes. 

For purposes of generalization one may designate such classifications 
as of a “presence-absence” type. The fourfold classification in joint 
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distribution of such variables may be portrayed in the general form of 
Table 46, wherein the letters designate the observed frequencies. The 
simplicity of this table naturally leads to simplifications of procedure in 
the computation of to determine independence of the two characters. 
Let a', b\ c', and d' indicate the expected cell frequencies on the assump- 
tion of independence. Then 


and o — 0 ^ is quickly determinable. Note, however, that h — b' must 
equal a — a' numerically but have the opposite sign; also c — c' must 
equal h ~ b', and d — d' must equal a — a'. Thus only one expected 
frequency need be calculated in the usual way, the other three being 
written down by addition or subtraction of a — from the observed 
frequency. This leads to a very quick computation of using the 
basic formula. 


TABLE 46 

The Fourfold Table in General Symbols 



Character x 

Totals 

Present 

Absent 

Character y 

Present 

a 

h 

e 

Absent 

c 

d 

f , 

Totals 


h 

N 


Following the general rule previously given, the fourfold table has 
but one “degree of freedom.” It has just been noted that only one 
expected frequency need be determined from the array totals; the 
others are dependent on the deviation of the theoretical frequency from 
the observed in this one cell. Now” the sampling distribution of when 
n equals unity is very closely related to the normal curve. Indeed, if 
the square root of y? is taken, then the sampling distribution of x thus 
secured is the positive half of the normal curve with x equal to the rela- 
tive deviate h. Since 1.96 is the 6 per cent point in then 3.84 is the 
5 per cent point in the x^ distribution with but one independent element. 




THE FOURFOLD TABLE WITH SMALL FREQUENCIES 235 


THb adjacBiit tablG of observed frequencies for the effect of previous 
vaccination on recovery from smallpox in 2,081 cases of the disease has 
been provided by Pearson.^ It is not surprising to find from the com- 
putations in Table 47 that is so large (177.3) that the probability of 
independence of the two variables is negligible. 


TABLE 47 

Effect of Priok Vaccination on Recovery from Smallpox 
(Data from K. Pearson*) 




Kflfeet of smallpox 






Totals 



Recovery 

Deatli 




0 

1,502 

42 

1,604 



c 

1,499 

105 


’C 

Present 

o — c 

+ 63 





(«-c)2 




T! 



2.6 

37.8 


C 


c 




o 






% 

c 


0 

383 

94 

477 



c 

446 

31 


CJ 

>■ 

Absent 

0-C 


+63 











8.9 

128.0 




c 




Totals 

1,945 

136 

2,081 


T? = 177.3 

n ^ 1. 




THE FOURFOLD TABLE WITH SMALL FREQUENCIES 

Let us assume now that Pearson had available only about one-tenth 
aa many cases, distributed in essentially the same proportions. The 
distribution in Table 48 is suggested. Would demonstration of the 
lack of independence of the two variables still hold? 

* Drapers’ Company Research Memoirs, Biometric Series VII. London. 1912. 
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The criterion may be applied as before, but it is worthy of note 
that, when the cell frequencies a, h, c, and d, of the fourfold table are 

TABLE 48 


Short-Cut Calculation of Using Equation (3) 
Imaginary data derived from Tabic 47 



Effect of 

smallpox 



Recovery 

Death 

Totals 

a 

o 

Present 

a 

156 

4 

160 

« u Absent 
t> 

38 

9 

47 

Totals 

194 

13 

207 

2 324,473,228 

^ “ 18,965,440 

= 17.1 n -- 

= 1. P^2 = 

= 0.00004. 


small, a short-cut formula for niay be applied with some saving of 
time. The alternative formula is: 

, N{ad - hey 

^ ■ (3) 

e-f-g>h 

This equation is very easy to remember if the following points are noted : 

(1) ad and he are the products of the diagonally opposed cells of 
the table, and 

(2) the denominator is the product of the 4 array totals derived 
from these cells. 

It is only a matter of routine elementary algebra, which need not be 
given here, to show the identity of this equation with the fundamental 
one applied to a fourfold table. Using equation (3) in Table 48 because 
the cell frequencies are not very large, one finds equals 17.1, a value 
clearly significant of association. The answer could have been given, 
however, without further calculation, for, with a given set of propor- 
tions, varies in direct ratio with N. 
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c 


L N J 

Had the frequencies been reduced to precisely one-tenth their previous 
values, would have been reduced in like manner. 

This is a very useful proposition to remember. With the same num- 
ber of independent deviations in which the ratio of o to c is preserved 
but N is varied, will be directly proportional to N. That is, as N 
increases in such a situation, the probability of the deviations arising 
through sampling errors alone will fall. This is in accord with logical 
demands, of course. However, the point is too easily overlooked that 
concordance between an observed and a theoretical set of frequencies 
that may appear to be excellent in a geometric representation may 
nevertheless prove to be very poor by the criterion if N is very large. 
And inversely, when N is small, the criterion may be quite insensitive 
to a lack of concordance that, to the student of science, may appear to 
be important. In the problems of science, consideration of the impor- 
tance of discrepancies should be guided, not replaced, by contemplation of 
statistical significance tests. 

The question of a differential case fatality between the sexes in the 
Mankato typhoid epidemic of 1908 has been considered previously.^ A 
test of the significance of the difference between the two proportions 
of fatality showed that the probability of the observed discrepancy 
arising through sampling errors was quite high {k ~ 0.305, P ~ 0.76). 
This calculation might equally well have been handled in terms of fre- 
quency instead of proportion, using the x^ criterion in place of the differ- 
ence in rate test there given. Table 49 presents the calculations. One 
hardly needs proceed beyond securing the first expected cell frequency 
to realize that the concordance of observation with expectation on the 
theory of no sex difference is excellent. The full calculations are shown, 
however, to demonstrate the identity of the two tests as solutions to the 
same problem. Any pair of proportions or rates, derived from samples 
of specified sizes, may be tested for concordance in terms of the propor- 
tions or the frequencies, as desired. The answer will be the same, 
although the two sets of calculations seem widely different. According 
to this principle, heterogeneity in a set of more than two propor- 
tions or rates may be tested very simply by transforming to frequencies 
and using the criterion. 


^ Vide page 206. 
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TABLE 49 


Test for Independence or Sex and Mortality from Disease. 
Typhoid Fever Epidemic in Mankato, 1908 





Result of disease 





Death 

Survival 

Total cases 

^ : 

Male 

0 

c 

0 — c 

c 

16 

15.137 

+0.863 

0.0492 

205 

205.863 

-0.S6S 

0.0036 

221 

Female 

0 

c 

0-C 

c 

19 

19.863 

-0,863 

0.0375 

271 

. 270.137 
+0.863 , 

0.0028 

290 



Totals 

35 

476 

511 


= 0.093L X = 0.306. n = 1, = 0.76. 






APPENDIX I 


A TABLE OF NORMAL CURVE FUNCTIONS 

h is the relative deviate of the measure which is normally distributed. 
If X is the variable, then 


P is the probability of the h magnitude being exceeded solely through 
errors of random sampling. 

Area in tail beyond k 

Area of curve segment having same sign as k 

V) is the ordinate at h for a curve of unit area and unit standard deviation. 

The reader may note carefully the definition above of the tabled probability. 
Since the normal curve is symmetrical about A = 0, this probability is identical 
with that defined by the ratio of the area in both tails, ignoring the sign of k, to the 
area of the whole curve. It is in the latter form that the probability is usually 
defined. The sign of k is ordinarily considered of no consequence since it may be 
reversed at will. The choice between 5 - and y - x may justly be regarded as a 
perfectly empirical one, and as a consequence the deviation alone may be considered 
important, not its sign. The simplicity of this type of reasoning has considerable 
appeal, but it is not without its difficulties when the sampling distribution to bo 
used in testing the significance of the difference between other statistics proves to 
be skew. As a result of repeated reflection on this problem in its general form, the 
author prefers to consider difference problems according to either one of the patterns 
given below. 

Let a and 6 represent two statistics, the significance of the difference between 
which is to be judged. Then two intimately related propositions arise, as follows: 

(1) If the sign of o - 6 is of consequence, then one is concerned with deter- 

mining how often, among all possible differences of the same sign which will occur 
through errors of random sampling alone, one as great as a — 6 will arise. This 
will involve considering a single tail of the difference sampling distribution in 
relation to the curve segment of the same sign. , i j c 

(2) If the sign of a - 6 is logically to be ignored, then \ a-h \ defines a 
tail in a new distribution of such quantities aa itself which would arise solely 
through errors of random sampling. This distribution will start at zero. Also, 
if the a - 6 distribution is symmetrical, then the | a - 6 ] distribution is coinci- 
dent in form with the positive half of the a - & distribution, and the probability 
will be identical with that given under (1) above. 

Since the nonnal curve is symmetrical, solutions for both propositions above become 
identical in any specific instance. 
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Table of N (mml Curve Fimdions 


k 

P 


k 

P 

w 

1.00 

.3173 

.242 

1,50 

,1336 

130 

1.01 

.3125 

.240 

1.51 

,1310 

128 

1.02 

.3077 

.237 

1,52 

.1285 

126 

l . U .5 

.3030 

.235 

1.53 

.1260 

.124 

1.04 

.2983 

.232 

1.54 

.1236 

.122 

1.05 

.2937 

.230 

1.55 

. 12 ] 1 

120 

1.06 

.2891 

.227 

1.56 

.1188 

118 

1.07 

.2846 

.225 

1.57 

.1164 

116 

1.08 

.2801 

,223 

1.58 

.1141 

,115 

1.09 

.2757 

.220 

1.59 

.1118 

.113 

1.10 

.2713 

.218 

1.60 

.1096 

111 

111 

.2670 

.215 

1,61 

.1074 

, 109 

1.12 

.2627 

.213 

1.62 

.1052 

.107 

1,13 

.2585 

.211 

1,63 

.1031 

.106 

1.14 

.2543 

.208 

1,04 

.1010 

.104 

1.15 

.2501 

.206 

1.65 

.0989 

.102 

1.16 

.2460 

.204 

1.66 

.0969 

.101 

1.17 

.2420 

,201 

1.67 

.0949 

.099 

1.18 

.2380 

.199 

1,68 

.0930 

.097 

1.19 

.2340 

.197 

1,69 

.0910 

.096 

1.20 

.2301 

.194 

1.70 

.0891 

.094 

1.21 

.2263 

.192 

1.71 

.0873 

,092 

1.22 

.2225 

.190 

1 .72 

.0854 

.091 

1.23 

.2187 

.187 

1.73 

,0836 

.089 

1.24 

.2150 

,185 

1,74 

.0819 

,088 

1.25 

.2113 

.183 

1.75 

.0801 

.086 

1.26 

.2077 

.180 

1.76 

,0784 

.085 

1.27 

.2041 

.178 

1.77 

.0767 

.083 

1.28 

.2005 

.176 

1.78 

.0751 

.082 

1.29 

.1971 

,174 

1.79 

' .0735 

.080 

1.30 

. 1936 

.171 

1,80 

.0719 

.079 

1.31 

.1902 

.169 

1.81 

,0703 

,078 

1.32 

* . 1868 

.167 

1,82 

.0688 

.076 

1.33 

.1835 

.165 

1.83 

.0672 

.075 

1.34 

.1802 

,163 

1,84 

.0658 

.073 

1.35 

.1770 

.160 

1.85 

,0643 

.072 

1.36 

.1738 

,158 

1.86 

.0629 

.071 

1.37 

.1707 

.156 

1.87 

.0615 

.069 

1.38 

.1676 

,154 

1.88 

.0601 

.068 

1 39 

. 1645 

,152 

1.89 

,0588 

.067 

1.40 

.1615 

.150 

1.90 

.0574 

.066 

1.41 

,1585 

.148 

' 1.91 

.0561 

.064 

1.42 

.1556 

,146 

1.92 

.0549 

.063 

1 43 

.1527 

.144 

1.93 

.0536 

.062 

1.44 

.1499 

.141 

1.94 

,0524 

.061 

1.45 

.1471 

.139 

1.95 

.0512 

.060 

1.46 

.1443 

.137 

1.96 

.0500 

.058 

1.47 

.1416 

.135 

1.97 

,0488 

.057 

1.48 

.1389 

,133 

1.98 

.0477 

.056 

1.49 

.1362 

,131 

1.99 

.0466 

.055 


241 




Table of Normal Curve Functions 


k 

P 

w 

k 

P 

w 

2.00 

.0455 

.054 

2.50 

.0124 

.018 

2.01 

.0444 

.053 

2.61 


.017 

2.02 

.0434 


2.52 


.017 

2.03 

.0424 


2.53 


.016 

2.04 

.0414 


2.54 

.0111 

.016 

2,05 

.0404 


2.55 

.0188 

.015 

2,06 

.0394 

.048 

2.56 

.0106 

.015 

2.07 

.0385 


2.57 

.0102 

.015 

2.08 

.0375 


2.58 

.0099 

.014 

2,09 


,045 

2.59 

.0096 

.014 

2,10 

.0367 

.044 

2.60 

.0093 

.014 

2.11 ' 

.0349 

.043 

2.61 

.0091 

.013 

2,12 

.0340 

.042 

2.62 

.0088 

.013 

2.13 

.0332 

.041 

2.63 

.0085 

.013 

2.14 

.0324 

.040 

2.64 

.0083 

.012 

2.15 

.0316 

.040 

2.65 

.0080 

.012 

2.16 

.0308 

.039 

2.66 

.0078 

.012 

2.17 

.0300 

.038 

2.67 

.0076 

.011 

2.18 

.0293 

.037 

2.68 

.0074 

.011 

2.19 

.0285 

.036 

2.69 

.0071 

.011 

2.20 

.0278 

.035 

2.70 

.0069 

.010 

2.21 

.0271 

.035 

2,71 


.010 

2.22 

.0264 

.034 

2.72 

.0066 

.010 

2.23 

.0257 

.033 

2.73 

.0063 

.010 

2.24 

.0251 

.032 

2.74 

.0061 

.009 

2.25 ; 

.0244 

.032 

2.75 

.0060 

.009 

2.26 

.0238 

.031 

2,76 



2.27 

.0232 

.030 

2.77 



2.28 

.0226 

.030 

2.78 1 



2.29 

.0220 

.029 

2.79 1 

,0053 

.008 

2.30 

.0214 

.028 

2.80 

.0051 

.008 

2.31 

.0209 

.028 

2.81 i 


.008 

2.32 

.0203 

.027 

2.82 

.0048 


2.33 

.0198 

.026 

2.83 


.007 

2.34 

.0193 

.026 

2.84 



2.35 

.0188 

,025 

2.85 ' 



2.36 

.0183 

.025 

2.86 



2.37 

.0178 

.024 

2.87 



2.38 

.0173 

.023 

2.88 i 


.006 

2.39 

.0168 

.023 

2.89 

.0039 


2.40 

.0164 

.022 




2.41 

.0160 

.022 

2.91 


.006 

2.42 

.0155 

.021 

2.92 

.0035 

.006 

2.43 

.0151 

.021 

2.93 


,005 

2.44 

.0147 

.020 

2.94 


.005 

2.45 

.0143 

.020 

. 2.95 

.0032 

.005 

2.46 

.0139 

.019 

2.96 

.0031 


2.47 

.0135 

.019 

2.97 


.005 

2.48 

.0131 

.018 

2.98 

.0029 

.005 

2.49 

.0128 

.018 

2.99 

.0028 

.005 
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APPENDIX n 

TABLES OF z, AS A FUNCTION OF r 

2r = i [log, (1 + r) - log. (1 - 01 = 1.1613 [log 
Table A. r from Zero to 0.89 by Hundredths 


r 

.00 

.01 

.02 

.03 

.04 

.05 

.06 

.07 

.08 

.09 

.0 

.0000 

.0100 

.0200 

.0300 

.0400 

.0.500 

.0601 

.0701 

.0802 

.0902 

.1 

.1003 

.1104 

.1206 

.1,307 

. 1409 

.1511 

.1814 

.1717 

.1820 

.1923 

.2 

.2027 

.21,32 

.2237 

.2342 

.2448 

.2554 

.2661 

.2769 

.2877 

.2986 

.3 

. 3097) 

.3205 

.3316 

..3428 

,3541 

.3654 

.3769 

.3884 

.4001 

.4118 

.4 : 

.4236 

.4356 

.4477 

,4599 

.4722 

,4847 ; 

.4973 

.5101 

.5230 

.5361 

.6 

.5493 

.5627 

.5763 

. 5901 

.6042 

.6184 

.6328 

.6475 

.6625 

.6777 

.6 

.6931 

. 7089 : 

.7250 

.7414 

.7582 

.7753 

.7928 

.8107 

.8291 

.8480 

.7 

. 8673 1 

. 8872 ! 

.9076 

.9287 

.9505 

.9730 

.9962 

1.0203 

1.04.54 

1.0714 

.8 

1.0986 

1. 1270 

1.1568 

1,1881 

1.2212 

1.2562 

1.2933 

1.3331 

1.3758 

1,4219 


Table B. r from 0.900 to 0.999 by Thousandths 


r 

.000 

.001 

.002 

.003 

.004 

.005 

.006 

,007 

.008 

.009 

.90 

1.4722 

1.4775 

1.4828 

1.4882 

1.4937 

1,4992 

1.5047 

1.5103 

1.5160 

1.5217 

.91 

1..5275 

1.5334 

1.5393 

1.5453 

1,5513 

1.5574 

1.5636 

1.5698 

1.5762 

1.5826 

.92 

1.5890 

1.5956 

1.6022 

1.6089 

1.6157 

1.6226 

1.6296 

1.6366 

1.6438 

1.6510 

,93 

1.6584 

1.6658 

1.6734 

1.6811 

1.6888 

1.6967 

1.7047 

1.7129 

1.7211 

1.7295 

.94 

1.7380 

1.7467 

1,7655 1 

1.7645 

1.7736 

1.7828 

1.7923 

1.8019 

1.8117 

1.8216 

.95 

1.8318 

1.8421 

1.8527 

1.8635 

1.8745 

1.8857 

1.8972 

1,909(1 

1,9210 

1.9333 

.96 

1.9459 

1.9588 

1.9721 

1.9857 

1.9996 

2.0139 

2.0287 

2.0439 

2.0595 

2.0756 

.97 

2.0923 

2.1095 

2.1273 

2. 14,57 

2.1649 

2.1847 

2.2054 

2.2289 

2.2494 

2.2729 

.98 

2.2976 

2.3235 

2.3.507 

, 2., 3796 

2.4101 

2.4427 

2.4774 

2.5147 

2.5550 

3.5987 

.99 

2.6467 

2.6996 

2,7.587 

2.8257 

2. 9031 

2.9945 

3,1063 

3,2504 

3.4534 

3.8002 
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APPENDIX m 


A GRAPH OF PROBABILITY LEVELS FOR THE RANDOM 
SAMPLING DISTRIBUTIONS OF r WHEN P IS 
EQUAL TO ZERO AND N IS SMALL 



Examples of application 

(1) With N equal to 20, a value r = 0.55 is obtained. By locating 
this point on the graph it may be seen that only once or twice in 100 trials 
would a value of r as high as the one observed arise through random 

sampling errors when p is zero. „ , i 

(2) When iV = 5, approximately 10 per cent of random samples trom 
ail uncorrelated supply will yield correlation coefficients exceeding 0.8. 
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APPENDIX IV 


TABLE OF THE PROBABILITY THAT DERIVED FROM n 
INDEPENDENT ELEMENTS, WILL BE EXCEEDED SOLELY 
THROUGH ERRORS OF RANDOM SAMPLING 


Probability Integral of 



Number of independent elements , n 

X 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

1 

.317 

.607 

.801 

.910 

.963 

.986 

.995 

.998 

.999 

.999 

2 

.157 

.368 

.572 

.736 

.849 

.920 

.960 

.981 

.991 

.996 


.083 




.700 

.809 

.885 

.934 

.964 

.981 


.046 

■EH 



■!b1 

.677 

.780 

.857 

.911 

.947 


.025 

.082 

.172 

.287 

.416 

.544 

.660 

.758 

.834 

.891 


.014 

.050 

.112 

.199 

.306 

.423 

.540 

.647 

.740 

.815 


.008 

.030 

.072 

.136 

.221 

.321 

.429 

.537 

.637 

.725 


.005 

.018 

.046 

.092 

.166 

.238 

.333 

.433 

.534 

.629 


.003 

.011 

.029 

.061 

.109 

.174 

.253 

.342 

.437 

.532 

10 

.002 

.007 

.019 

.040 

.075 

.125 

.189 

.266 

.350 

.440 

11 

.001 

,004 

.012 

.027 

.051 

.088 

.139 

.202 

.276 

.358 

12 

.001 

.002 

.007 

.017 

.035 

.062 

.101 

.151 

.213 

.285 

13 

** 

.002 

.005 

.011 

.023 

.043 


.112 

.163 

.224 

U 

** 

.001 

.003 

.007 

.016 


1^1 

.082 

.122 

.173 

15 

** 

.001 

.002 

.005 

.010 


mm 

.059 

.091 

.132 

16 

*« 

*« 

.001 

.003 

.007 

.014 

.025 

.042 

,067 

.100 

17 

** 


.001 

.002 

.004 


■jJU 

.030 

.049 

.074 

18 

** 

** 

** 

.001 

.003 


IQR 

.021 

.035 

.055 

19 

** 

** 

** 

.001 

.002 

.004 


.015 

.025 

.040 

20 

** 


** 

** 

.001 

.003 

jn 

.010 

.018 

.029 

21 

** 


** 

** 




.007 

.013 

.021 

22 

** 

19 

** 

** 



RSkI 

.005 

.009 

.015 

23 

** 

D 

** 

** 



.002 

.003 

.006 

.011 

24 

** 

** 

** 

** 

** 

.001 

.001 

.002 

.004 

.008 

25 

*« 

** 

** 

** 

** 

»* 

.001 

.002 

.003 

.005 

26 

** 

** 

** 

** 

** 

** 

.001 

.001 

.002 

' .004 

27 

** 

*« 

*« 

** 

** 

** 

*« 

.001 

, .001 

.003 

28 

** 

** 

** 

** 

** 

** 

** 

** 

.001 

.002 

29 

** 

** 

** 

** 

** 

** 

** 

** 

.001 

.001 

30 

** 

** 

«• 

** 

** 

** 

** 

** 

** 

.001 


** Leaa than .0005. 
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Probability Integral of 





Number of independent elements n I 

X 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

1 

.999 

.999 

.999 

.999 

* 

* 

* 

* 

« 

* 

2 

.998 

.999 

.999 

.999 

.999 

.999 

.999 

.999 

* 


3 

.991 

.996 

.998 

.999 

.999 

.999 

.999 

.999 

.999 

.999 

4 

.970 

.983 

.991 

.995 

.998 

.999 

.999 

.999 

.999 

.999 

5 

.931 

.958 

.975 

.986 

.992 

.996 

.998 

.999 

.999 

,999 

6 

.873 

.916 

.946 

.966 

.980 

.988 

.993 

.996 

.998 

.999 

7 

.799 

.858 

.902 

.935 

.958 

.973 

.984 

.990 

.994 

.997 

8 

.713 

.785 

.844 

.889 

.924 

.949 

.967 

.979 

.987 

.992 

9 

.622 

.703 

.773 

.831 

.878 

.913 

.940 

.960 

.973 

.983 

10 

.530 

.616 

,694 

.762 

.820 

.867 

.904 

.932 

.953 

.968 

11 

.443 

.529 

.611 

.686 

.753 

.809 

.857 

.894 

.924 

,946 

12 

.363 

.446 

.528 

.606 

.679 

.744 

.800 

.847 

.886 

.916 

13 

,293 

.369 

.448 

.527 

.602 

.673 

.736 

.792 

.839 

.877 

14 

.233 

.301 

,374 

.450 

,526 

.599 

.667 

.729 

.784 

,830 

15 

.182 ' 

.241 

.307 

.378 

.451 

.525 

.595 

.662 

.723 

.776 

16 

.141 i 

.191 

,249 

.313 

.382 

.453 

.524 

.593 

.657 

.717 

17 

.108 : 

.150 

,199 

.256 

.319 

.386 

.454 

.523 

.590 

.653 

18 

.082 

.116 

.158 

.207 

.263 

.324 

.389 

.456 

.522 

.587 

19 

. 06l ' 

.089 ' 

.123 ^ 

.165 

.214 

.269 

,329 

,392 

.457 

. 522 

20 

.045 

.067 

.095 

.130 

.172 

.220 

,274 

,333 

.395 

.458 

21 

.033 

.050 

.073 

.102 i 

.137 

.179 

.226 

.279 

.337 

.397 

22 

.024 

.038 ' 

.055 

.079 

.108 1 

.143 

.185 

.232 

.284 

.341 

23 

.018 

.028 

,042 

.060 

.084 

.114 

.149 

.191 

.237 

.289 

24 

.013 

.020 

.031 

.046 

.065 

.090 

.119 

.155 

.196 

.242 

25 

.009 

.015 

.023 

.035 

.050 

.070 

.095 

.125 

.161 

.201 

26 1 

.006 

.011 

.017 

.026 

.038 

.054 

.074 j 

.100 j 

.130 1 

.166 

27 

.005 

.008 

.012 

.019 

.029 

.041 

.058 

.079 

.105 

.135 

28 

.003 

.006 

.009 

.014 

.022 

.032 

.045 

.062 1 

.083 : 

.109 

29 

.002 

.004 

1 .007 

.010 

.016 

.024 

.035 

.048 

.066 

.088 

30 

.002 

.003 

.005 

.008 

.012 

.018 

.026 

.037 

.052 1 

.070 


* Greater than .9995. 
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APPENDIX V 

SELECTED FORMULAS 
Definitions of Statistics 


N 


Standard deviation 


-4 


- x)^ 
N 


Variance 


'Six - xy 


‘ N-l 

Maximum likelihood estimate of tr from s 

MJ 

Coefficient of variation 




C.V.,= 


100 Sx 


Relative deviate 


Correlation coefficient 


N 

N ' 


■ xy 


SxSy 
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Regression coefficient 



Rectilinear regression equation 

Y ~ ay-{- b,jX 
= {y ~ byx) + byX 


.. ^'J 1 


Sn 

y - r:,y -X\ 
L s^ j 


r-xy- 
L SiJ 


Standard error of estimate^ or standard deviation of the residual varia- 
tion about the rectilinear regression line 

— Sy — Tyy 

Coefficient of alienation 

C.A. = 

Coding and decoding equations 


if X = — — ^ a and b being s(‘lccted suitable constants, then 
0 

a; = a + 

^ - a + 


and 


= bs^ 


^ rxy — rjjY 

Mmneni coefficients. Quanlities generating descriptions of the prop- 
erties of frequency distributions 


nin 


N 


(?dh inonient coefficient) 


and ”Gamma” coeffiicients 

&i = ~ , or gi = (indices of skewness) 

mi 

h or ^2 ^ - 3 (indices of kurtosis) 

m2 ^2 

(Names are derived from the equivalent Greek letters used for the 
supply parameters.) 
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X^, a statistical measure of lack of agreement between observed fre- 
quencies and theoretical expectancy in a complete set of n' classes, is 
defined in various ways. The general definition is 

(a — c)^ 
c 



In testing independence in bivariate frequency tables, this may be re- 
arranged to 




{vide page 232) 


For the fourfold table it may be shown that 


X 


2 


N{ad - hc)^ 
e-f-g'h 


(vide Table 46, page 234) 


Frequency Distributions 

The normal curve. A frequency function defined by the equation 
w = 

~ the ordinate at k 

- ^ 
s\^2Tr 

= 3.1416 ••• 

= the natural base of logarithms, 2-7183 • • ■, and 
= the relative deviate value of the variable 

The binomial series 

n\ 

ip + «)" = p" + np^^'q + • ■ ■ + + ■ ■ ■ + 9" 

Rule for deriving the numerical coefficient of a term from that of the 
preceding term: 

Multiply the numerical coefficient of the known term by the power 
of p in that term and divide by one more than the power of q in that 
term. The result is the numerical coefficient of the following term. 

When n is large, the normal curve will give a good approximation to the 
binomial if np (or nq^ whichever is smaller) is greater than 10. 


where w 
C 
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The Poisson series 

This may be used as an approximation to the Innoinial when m = np 
is less than 10, n being large and p being very small. 

The Random Sampling Distributions of Statistics 

The curves of frequency distribution for statistics calculated from 
samples of size N drawn at randenn from a specified supply have the 
following parameters : 

Ariihmetic mean, x 

M5 = Mi 

djr 

O’? — - , - 

y/N 

Curve is essentially normal except when N is ton small to offset anormal- 
ity in the supply. 

Standard deviation, 

M,!^ O'/ 


<^1 



Curve approaches normality as N increases, essentially reaching that 
form when N is about 30 or more and the supply is normally distributed. 

Correlation coefficient, r 

mJ p 

1 - p2 

Curve approaches normality within only a small region of the {p,N) 
surface. For techniques circumventing this difficulty, sec Chapter 11. 

Proportions, p 





The distribution is that of the binomial [tt + (1 — which is rather 
closely approximated by a normal curve if Nir is 10 or more. 



252 


APPENDIX V 


Standard errors are estimates of the unknown true standard deviations 
of the random sampling distributions of statistics. They are commonly 
formed by replacing the unknown parameter with the “best estimate” of 
it; formulated from the sample. For large samples {N > 30) the follow- 
ing formulas are commonly employed: 


S.E.5 = 


Sx 




Sx 

V2N 


S.E., = 


1 _ y.2 

V'A' 


S.R., 




P{1 - p) 

'~N 


Tests of Significance 

(1) The deviation of a mean 


k = 


X - p. 


^ ^ S.E.i 


(Probability of k from Appendix I) 

(2) The difference between the means of t wo independeid samples 

, ^-y 


(=) 


V(r| + o-j 

^ - y 

VaEj+ S.E.^ 


(3) Standard deidaiions and proportions. Analogous tests follow the 
above procedures. Special techniques are required in small-sample work. 
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(4) The deviation of r from zero 
(i)iV = 20 or less 

. . vVn 


VI - r2 

(Probability of t from Appendix III) 


(ii)iV > 20 

^ [log,(i + r) - Iog,(l - r)] (See Appendix II) 

k = ZrVJfZ^ 


(Probability of k from Appendix I) 

(5) The differe7ice between the correlation coej^cients of two independent 
samples 

k = ^’’1 "" ^^2 


(Probability of k from Appendix I) 




INDEX 


Abscissas, axis of, 25 
Adjusted rates, 193-19G 

Age at marriage, England and Wales, 1932, 29-30 
Age distribution, Minnesota, 1930, 21 
Array, 229 

Assortative mating for stature, 85-90, 109 
Attributes, 4, 183 
Average, see Mean 

Bar diagram, 26-27 

Basal metabolism and weight, 96-98, 116-117 
Beta coefficients (and h statistics), 72-75, 249 . 

Binomial series, 172, 250 
beta and gamma coefficients of, 178 
derivation of, 167-173 
mean and standard deviation of, 176-178 
Binomial series and the nonnal curve, 175-180, 205-206 
Biological variation, 34-35 
Biometric method, 6 
Birge, 13, 223 
Birth rate, 199 

Bivariate distribution, 84-127, 227-238 

Brain weight, 28 

Broad classes, 18-19, 230-231 

Calcium inial^e and pyorrhea, 227-231 
Calculations, verification of, ix 
Cards, record, 23-24 
sorting, 22-25 
Case fatality rate, 187-189 
Causation, 84, 108 

Centering values of frequency curves, 41-43 
Central tendency, 39 
Chapin, 189 

Children per family, New Zealand, 1916, 16, 26-27 
‘'Chi squared," 210-238, 24&-247, 250 
proportionality to N of, 236-237 
Class centers, 55-56, 67-68 

Classification, 15-25, 26-32, 43-44, 55-58, 66-67, 85-87, 190-192, 222-224, 227-238 
errors of, 2-3, 18, 103-104 
Class range, true and apparent, 26-22 
Coding and code scales, ix, 42-49, 56-57 , 249 
Coefficient of alienation, 125-127, 249 
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INDEX 


Coefficient of correlation, see Correlation coefficient 
Coefficient of regression, see Regression coefficient 
Coefficient of reversion, 95 
Coefficient of variation, 64, 248 
Combinations, 168-173 
Confidence intervals or limits, 133-135, 138 
Continuous variables, 15 
Contour lines, 88-90, 94, 110, 113, 122, 164 
Corrected rates, 193-196 
Correlated means, 146-148 
Correlation, absence of, 86-87 
perfect rectilinear, 94r-95, 106-107 
Correlation coefficient, 84^107, 248 
calculation of, 96-103 
derivation of, 91-95 
sampling errors of, 152-164, 251, 253 
Correlation scale, 91, 106-107 
Correlation surface, portrayal of, 85-90 
Correlation table, 85, 99, 118, 227-238 
summations from, 100-101 
Corresponding largeness, 92 
Cotton seedling survival, 174-175, 219-220 
Cumulative frequency, 62-63 
Curve fitting, 74-75 

Davis and Worley, 125 
Decoding, 46, 57, 102, 249 
“Degrees of freedom,” 216, 234 
Deming, 216, 218 
Deming and Birge, 138 
De Moivre, 76 

Description, subjective and objective, 1-3 
Deviates, 52-54, 60-61 
Diastatic activity and gas pressure, 125 
Dice, Weldon’s experiment with, 220-222 

Differences, between correlation coefficients, sampling errors of, 160-161, 253 
between means, sampling errors of, 138-147, 252 
between proportions, sampling errors of, 206-209 
between standard deviations, sampling errors of, 144 
Diphtheria, 203-205 
Discrete variables, 15, 20, 174 

Elderton, 75, 214 

Equilibrium, 66-67 

Equivalent scores, 81-83, 118-119 

Erroneous inference, 149-151 

Error of estimate, 124r-125, 249 

Erythrocyte count, 38, 55 

Examination scores, 14, 62-63, 82-83, 118-119 

Experimental design, 146-148 
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Finger length, 14, 18-19, 132 
Fisher, 61, 138, 157-160, 163, 215-216 
Force of incidence of events, 184-185 
Fourfold table, 233-238, 250 
Frequency, 13 

Frequency curves, 12-15, 30-35, 41-42, 66-75 
illustrations of, 13, 14, 31, 33, 34, 73, 74, 79. 82, 122, 132, 134, 136, 140 154 167 
160,179,217,218 ' ’ 

Frequency discordance, measurement of, 210-238 
Frequency distribution, law of, 12-35 

Gallon, V, 12, 50, 69, 76, 84, 91, 109, 119 
Gamma coefficients (and g statistics), 249 
Gas pressure and diastatic activity, 125 
Gauss, 12, 138, 216 
Geissler, 173, 203, 215-216 

Generalization from samples, 8, 10-11, 128-164, 200-209 
Gibbs, 5 

Graphical representation of data, 25- 30 
Grouping, 15-25, 43-44, 230-231 
corrections for, 18, 103-104 
irregular, 20-21, 49, 191 
unit of, 30-32, 43-45 
Grouping effects, 103 

Haden, 38 

Harris, vii, ix, 1, 54, 96 

Harris and Benedict, 96 

Head breadth, 17 

Health, 2-3, 183-199 

Height, 2-3; see Assortativc mating 

Histograms, 27-31 

illustrations.of, 13, 14, 30, 31, 62, 82, 132 
solid, 88 

Homoscedasticity and heteroscedasticity, 121, 124 

Independence, test of, for bivariate tables, 227 238 

Independence defined, 169-170 

Independent deviations, 60-61, 215-216, 219-224, 229 

Individual differences, 51-52 

Infant mortality rate, 199 

Information, amount of, 10-1 1 , 60 

Intelligence, 7 

Isograms, 88 

Kurtrjsis, 33-34, 67, 71-74, 178 
Laplace, 12 

“Least squares,” 5^60, 114 

Light, width of a spectral band of, 13, 81, 223-224 
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INDEX 


Macdonell, 17, 18, 19 
Maternal mortality rate, 187, 199 
Maximum likelihood estimate of <?■, 61, 248 
Maynard, 16, 16, 27 

Mean, arithmetic, 36^37, 42, 52-54, 58-59, 67, 248 
computation of, 44-49 
geometric, 37 
harmonic, 37 
Mean deviation, 59 
Mean squared deviate, 53, 59 

Means, sampling errors of, 50, 131-133, 222-223, 251, 252 
Measurement, 4, 7 
increment of, 17-18, 30 
precision in, 5, 6, 16, 20-22 
Median, 38-39, 42, 59 
Mendel, 210 
Mendelian ratios, 227 
Modal class or value, 39-40, 41 
Mode, 41-43 

Moment coefficients, 53, 68-72, 249 
Moments, 66-75 
Morbidity rate, 199 
Mortality rates, 185-199 

Neonatal mortality lute, 199 

Normal curve, 12-15, 34, 72-74, 76-83, 121-125, 175-180, 223-234, 250 
tables of, 80-81, 239-243 
Nonnal surface, equation for, 104 
regressions for, 110-116 
Null hypothesis, 139 
Number, language of, 3-4 
Number of classes, 18, 44, 49 
Numerical description, 1-11 

Observation, errors of, 6 
Observations, number of, 6, 7 
Ordinates, axis of, 25 

Ordinates of the normal curve, 77-78, 239-243 
Overminuteness, 6 

Parameters, 130 
Partial association, 91 
PeaJeedness, 33-34 
Pearl, 184 

Pearson, 9, 41, 54, 69, 71, 72, 74-75, 76, 95, 210, 213-216 

Pearson and Lee, 85 

Permutations, 168 

Poisson series, 180-182, 251 

Population, 7-8, 128-129 

Prediction index, 126-127 
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Probabilities of sampling errors, 128-1 64, 200-209 
Probability and the binomial series, 165-182, 200-209 
Probability and the normal curve, 78-81, 239-243 
Probability defined, 165 
Probability of death, 185-199 
Probable error, 80 
Product moment, 102 
Proportions, 165-199 
sampling errors of, 200-209, 251 
Protein, errors of determination, 145 
Pyorrhea and calcium intake, 227-231 
Pyorrhea and vitamin C intake, 232-233 
high values of, 225-226 


Quartiles, 61-64, 80 
Quetelet, 4, 12, 76 


Radusch, 227 
Random numbers, 129 
Random sample, 8, 129 

Random sampling, errors of, 8, 76, 128-164, 200-209, 251 
Range of variation, 50-51 
Ranking, 38 
Rate scales, 189-190 
Rates, 166, 184-199 
Reasoning, inductive and deductive, 8 
Red blood cell count, 38, 55 
Regression, coefficient of, 115-116, 249 
curvilinear, 106-107 
rectilinear, 108-119, 249 

Regression equation, calculation of rectilinear, 116-118 
derivation of rectilinear, 112 115 
Regrei5sion lines, empirical and theoretical, 109 
Relative deviates, 77, 248 
correlation and, 92-93 
equivalent scores and, 82-83 
regression and, 112-114 
the normal curve and, 77-81, 239-243 
Relative variation, 64, 125-127 
Residual variation, 108, 120-127 
Retzius, 28 


Sample and the supply, 7, 128-129 
Sample, size of, 10 
Scarlet fever, 188-189 
Scatter diagram, 85-86, 116, 125 
Science, task of, 1 
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Semi-interquartile range, 61 
Seriation, 15-25, 39, 43-44, 87, 98-104 
Sex ratio, 173 

Sheppard’s corrections, 103-104 
Significance, levels of, 150 
Significant dilTerentiation, 148-149 
Skewness, 32-33, 42, 67, 611-72 
Small samples, viii, 40, 138, 192, 204 
significant correlation from, 161-164, 245 
Soper, 157 
Species, 4 

Specific rates, cause, 197-199 
group or class, 190-196 
Spurious interpretation, 127 
Squared deviations, 53, 58- 59, 114 
Standard deviation, 54-60, 67-68, 71, 77, 248 
sampling errors of, 135-136, 251 
Standard errors, 136-138, 142, 147, 202-203, 252 
Standard scales, 77, 79, 166 
Statistical method, 5, 6 
Statistics, 40, 128-131 
biased and unbiased, 132, 137-138, 202 
desirable qualities of, 40^1 
sampling errors of, 128-164, 200-209, 214-219 
Stillbirth rate, 199 
Student, 162, 181-182 
Summation, 37, 46-47, 55-56, 100-101 
Supply, 7-8, 128-129 
Symbolic analysis, 9 40 
Symmetry, 32-33, 42, 67, 69-72 


Tally stroke, 22 
Temporal rates, 188-189 
Tippett, 129 

Transformation of scale, 159-160, 244; see Coding, llelative deviates, Rates 
Typhoid fever, 200-201, 205-208 
Typical values, 36-49 

Universe, 7 


Variables, 10 

dependent and independent, 111-112 
Variance, 60-61, 248 
Variates, 10 
Variation, 4-6 
amount of, 50-65, 67 
causes of, 5 
residual, 108, 120-127 
type of, 26-35, 65, 66-75 
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Vital statistics, 166, 183-199 
Vitamin C intake and pyorrhea, 232-233 

Walker, 4 

Weight and basal metabolism, 96-98, 116-117 
Weight of new-born infants, 30-31, 57-58 
Weldon, 69, 220-222 







